Previous month:
August 2018
Next month:
October 2018

September 2018

Whois studies: it's time to ask the right questions

I remain skeptical of all the Whois studies that I’ve reviewed (FTC, SSAC, ICANN), including studies where I was a party to the research. I’ll apologize for failing to contribute to a satisfactory Whois study. I’ll also admit that my understanding of how to study a problem scientifically has greatly expanded over the past ten years.

A truly scientific Whois study should meet scientific must meet certain common criteria. The purpose should be clearly defined; in particular, the researchers or parties who commission the research should make certain that they are asking the right question.

Before I raise anyone’s brows or blood pressure, I’m not suggesting that these studies lacked rigor.

The researchers were asked to answer a question and they did.

I believe that the wrong question was studied.

The moment you ask, “Does public access to Whois-published data lead to a measurable degree of misuse?” or “Is the WHOIS Service a Source for email Addresses for Spammers?”, you’ve constrained the research scope too narrowly. You’ve also set up a strawman. There’s a difference between asking the wrong question and asking the right question but in the wrong way

These studies had different findings. They appear to have been conducted scientifically. The Whois-specific findings and conclusions do not provide a satisfactory comparative analysis of Whois personal data harvesting versus other methods of collecting personal data; in particular, they do not help us understand

How much of a threat is personal data harvesting using Whois relative to harvesting using other sources?

Why is this the relevant question? Databases that contain personal data are exposed to multiple threats. To better understand the personal data harvesting issue, we need to consider not one but many of the known personal data harvesting methods that are used by criminals or misused commercially. The study should compare a known set of methods of personal data collection and rank Whois among these. The list of methods that cybercriminals or commercial interests use to collect personal data might include:

  • Database breaches, e.g., an attack resulting in the theft or disclosure of tax records or filings, financial accounts, credit card accounts, merchant accounts, bank logins, corporate (HR, ERM, CRM) or government user accounts.
  • Crawling and Search, e.g., visiting hundreds of thousands of web sites and extracting email addresses or other personal data from the pages visited, and employing sophisticated or advanced searches across billions of publicly accessible pages that search engines index.
  • Traffic or message capture, e.g., capture of personal data from unencrypted or compromised web (http) sessions, email transmissions or file transfers, or the exfiltration of data by malware through covert channels.
  • Email user enumeration attacks, e.g., brute force or “dictionary attacks” that harvest active accounts, and which are succeeded by password cracking attacks against enumerated users.
  • Malware, the collection of personal data from contacts files by malware that has infected a computer or mobile device, and Ransomware, the extortion of financial account information or payments through malware that has infected a computer or mobile device.
  • Whois queries, the collection of point of contact information using the public Whois service.
  • Social Engineering Attacks including Phishing, the luring of victims to impersonation financial or corporate login pages where the victim submits personal or financial account information; Doxing, a social engineering and information harvesting of personal information, especially from social media sites; and Business Email Compromise (CEO Fraud), the use of social engineering and email account impersonation to acquire financial data or to hijack or perform fraudulent wire transfers.
  • Disclosure by third parties that aggregate and sell or share personal data, without notice or consent.

The ICANN study only asked about a single threat: Whois. This give us the same kind of insight as a study of cancer in which the question posed is “is X a cause of cancer?” The X study may give us an answer, but it will not be a very comprehensive one. Questions such as “what are the possible causes of cancer, which are most dangerous (aggressive), and which are the causes that we are most likely to be exposed to?” provide more and deeper insights. For example, a study may find that eating red meat causes cancer, and another study may find that extended exposure to the sun does, too, but we want to understand which of these and other possible causes of cancer is the greater or greatest risk so that we can weigh the risks and most importantly focus our attention on mitigating the worst among the risks.

If a study were to be conducted using the threats I listed above, I’m willing to speculate that the findings would rank Whois fairly low, for the reasons explained in the following table:

Threat to Personal Data

Risk

Popularity

Data Quality and Value

Reason
for
ranking

Database Breaches

Medium-High

High

High

Databases generally have complete and accurate personal data: contact data as well as account and verification (e.g., CCV) data. Value of
Difficulty: LOW. A wealth of exploit kits for common (SQL) vulnerabilities found in open repositories or Dark Web.

Malware, Ransomware

High

High

High

Lax security/encryption practices, vulnerable software, and poor configurations expose data malware attacks, unintentional leaks (e.g., posts to public not friends).

Difficulty: LOW. Malware, ransomware can be downloaded from open repositories, social media pages or Dark Web. Attack networks can be leased from spam infrastructures (e.g., Avalanche) or botnet herders.

Social Engineering Attacks: Doxing, BEC, Phishing, Advanced Fee Frauds

High

High

High

Highly sophisticated attackers collect personal data from targeted individuals, businesses or government agencies. 

Difficulty: High. Doxing and BEC in particular require direct or personal exchanges between attackers and targets, with typically lucrative returns.

Disclosure by 3rd parties

Medium-High

Medium-High

Varies

Notice of use or informed consent is poorly practiced, oversight (until GDPR) is poor. Users make poor or uninformed choices based on cost or desirability of service.

Difficulty: Varies. Collection is not difficult, but 3rd parties invest in measures to hide this activity from public or regulators. 

Crawling and Search

Medium

High

Varies

Anything data can be indexed can be found using advanced searches and disclosed or misused. Automation (scripts) can collect and extract email addresses or other personal data from millions of web pages with ease.

Difficulty: LOW. Automation can be downloaded from open repositories, social media pages or web pages.

Traffic or Message Capture

Medium

Medium

Varies

Traffic capture today is sometimes state-actor sponsored and typically a patient and long-term extraction of intelligence. The completeness and accuracy of any personal data collected depends on the targeted environment.

Difficulty: MEDIUM. These attacks are malware-driven. Some is custom, some available through open repositories or Dark Web.

Email User Enumeration

Medium

Medium

Medium

These attacks are conducted to generate email address lists or to gather intelligence or to acquire targets. Financial data and other personal data cannot be collected by this means. The lists are much less valued in underground marketplaces than database breach data. There is  a market for such synthesized mailing lists but the parties who purchase them often end up blocklisted because the sellers often collect spamtrap addresses.

Difficulty: LOW.  Automation can be downloaded from open repositories, social media pages or web pages.

Whois queries

Medium

Low

Poor

Whois point of contact information is often inaccurate. Whois does not contain financial data. The lists are much less valued in underground marketplaces than database breach data.

Difficulty: LOW. Automation can be downloaded from open repositories, or web pages. Whois rate limiting impedes collection.

The FTC study attempted a comparative harvesting analysis in 2002. This was the right context. The FTC asked the right question. Let’s repeat the study, with the caveat that the list of prominent attack vectors should be modernized to the list of today’s popular harvesting methods, and the procedures, measures, and data analysis should be reviewed to assure that the new study is scientific.

A scientific study of this breadth would allow us to assess the composite risk of private data disclosure as well as the individual risk through each of these collection methods. It would give us an opportunity to use data to drive informed consensus policy development for an important issue.


Post-GDPR WHOIS: A Myriad of Misconceptions, Misinformation and Misdirection

One of the most memorable lyrics of For What It’s Worth (Buffalo Springfield, 1967) aptly describes the current condition of the post-GDPR debate over domain registration data access:

 There’s battle lines being drawn… nobody’s right if everybody’s wrong. 

Cybersecurity and policy pundits are heatedly engaged over the impact of the EU General Data Protection Regulation (GDPR). Both sides have done a poor job of articulating the problem space, overlooking key aspects of the regulation and ICANN’s attempt to comply to GDPR in a Temporary Specification For Whois.

As difficult as it is to engage in this discussion dispassionately, it’s both necessary and urgent that we re-focus attention to the problem space.

I’ll begin by considering comments in a recent post, Special Interests Push US Congress to Override ICANN’s Whois Policy Process. I question several of the assertions made in this article:

The author asserts that publication of Whois point of contact data exacerbates spam. This remains a poorly studied theory with no definitive conclusion. Whois may be a source of email addresses for spammers, but the true question is whether it is the most popular, least expensive, and most effective means. Compared to crawling web pages or social media sites where users commonly publish email addresses, Whois is far too slow, and the number of email addresses that someone can conceivably scrape from Whois is small compared to what is available with less latency from the web. Spammers are also quite accomplished at synthesizing the user name part of an email address and flooding mail exchanges with permutations or combinations of addresses with near zero overhead. Given that there is no popularly accepted study or any study that satisfies academic rigor on which to conclude that spammers use Whois as a primary source for email address collection, this remains speculation.

The author asserts that “Registrars and Registries must still provide reasonable access to personal data to third parties with legitimate interests that are not overridden by privacy rights, such as law enforcement agencies pursuing criminals”. The author asserts this as fact, based perhaps on the temporary specification but evidently not on practice or recent experience. In practice, reasonable access by third parties is flawed in several important respects. There is presently no uniformity across registrars and registries with regard to interpretation and implementation of “Reasonable access”, “third parties”, and “legitimate interests”. The process for requesting access is not well publicized, and at least three frameworks for accreditation have been proposed. There is no defined ICANN compliance process to contest whether a registry or registrar has failed to meet the reasonable access obligation. Some are responsive. Some ignore any request. Some insist on court orders. When denied access in these circumstances, investigators cannot determine whether the domain registration has inaccuracies or fraudulently composed data. As email expert John Levine explains in his recent post, investigators can’t “find connections among domains (which tend to be registered with similar information, even if it's false) to take down a whole network of them at a time.” These same flaws affect trademark protection.

The author fails to discuss the most egregious problem with redacting non-public Whois data: response time for access requests is indeterminate but, in every case, it’s longer than Whois query time prior to 25 May 2018. Investigators cannot provide timely victim notification; simply put, they can’t contact a registrant whose web site has been hacked or is hosting a phishing page. In cases involving malware hosting, or spam campaigns that deliver malware, investigators strive to make near real time blocking or takedown decisions. Optimally, investigators want to blocklist, suspend name resolution, or remove harmful content in 1-4 hours. Meeting this objective was challenging before GDPR. It is more so now, and the consequence to registrants is that security or email administrators have to make allow/deny decisions, with less precision.

In an earlier post, How Far Will Email Operators Take Blocklisting to Prevent Spam?, I explained that ”email administrators weigh risk against reward when they make decisions regarding how to mitigate spam. They think first or exclusively about the security of their organization, their users, or their customers.” Domain registrants should interpret this as an indication that false positives or universal acceptance carry less weight than risk mitigation. The best precision that email or security administrators may have with the limited intelligence that Whois offers may be TLD, sponsoring registrar, or name server. Blocking entire registries may become as accepted a practice as dropping traffic to ASNs with poor reputations. If you’re defending your network or protecting your users, blocking the most abused registries and registrars makes even more sense today than prior to GDPR.

While post-GDPR policy suppresses attributable data and makes it very difficult to pivot from one data set to another, the web thankfully does not. Let’s consider another recent article, 90 Days of GDPR: Minimal Impact on Spam and Domain Registrations.  The authors assert that there has been a decline in spam volume in the generic Top-Level Domain. John Levine notes, and I concur, that the authors are studying the wrong question. Whether or not spammers would send more spam and register more domains because GDPR came into effect “tells us nothing useful about how GDPR affects anything. It's the wrong question, it's not a question most security people are concerned with, and it ignores how spam and spammers work.”

John explains the fundamentals of spam so well in his article that I’ll encourage you to read it carefully, especially if you are not involved in spam detection or mitigation. I will add, however, that Talos system and others try to measure spam message volume. Spam message volume fluctuates, sometimes dramatically. There are many factors that influence when spammers run campaigns: promotional pricing, TLD ownership, changes in spam countermeasures, or botnet take downs all affect spam distribution. The authors should have considered such factors. For example, Global Registry Services, Ltd. has recently taken over running Famous Four Media’s new gTLD portfolio, which has been previously reported as the spammiest block of new TLDs. The new operators promise to “abandon the failed penny-domain strategy and crack down on spam”.  This action will hopefully result in a clean-up of these TLDs.  90 days is too small a measurement window to evaluate whether this factor or others like aggressive blocklisting, GPDR, or the dismantling of a particular spam delivery infrastructure such as Avalanche are the collective causes for spam volume decline.  

The authors dismiss concerns that the spam problem is growing as “popular opinions among security researchers”. This may be a popular opinion, but it’s not accurately attributed. From my experience in this field and my daily interaction with security researchers, it’s more accurate to say that operational security practitioners worry that spam mitigation is more encumbered today, and the ICANN Temporary Spec for Whois, not the GDPR, makes it so. I’ll repeat: focus on the problem space.  

The authors shift to discussing to changes in domain name registration volume. Like spam message volume, new registration measurements are not particularly insightful when studying spam. Spammers register thousands of domain names through registrar bulk registration services, at rates of hundreds per minute, thousands per day, periodically, in some cases 6-8 months in advance of using them. New domain registrations are not the relevant denominator in calculating spam abuse percentages. Since only names that resolve in the DNS can be used in spam, new registrations tell us more about pricing, promotions, and flocking behavior than spam trends. Again, I refer you to ICANN’s DAAR and in particular, the methodology paper.

They note that the Spamhaus Most Abused TLDs list contains many new TLDs, and then discuss how registrations among these have declined. They observe that new registrations data that they studied “contradict the idea that spammers are focusing on registering domains that might be used for spam later. The first and most obvious is the uptick in the percentage of .com domain registrations”. How they arrived at this conclusion eludes me.  The authors should have looked at the history of the new TLD program, perhaps using cached Spamhaus abused TLD pages from archive.org. A 90-day observation window tells us little that we did not observe before GPDR and misses a great deal as well. Security researchers know from longer historical data that spammers have migrated from one or sets of new TLDs to others since the inception of the program. Early in the program, XYZ was a heavily spammed TLD. Nearly every one of the TLDs in Famous Four Media’s portfolio have been in the top 10 most abused TLDs over the past two years. Flocking to and migrating from new TLDs has largely followed pricing. Registrars that accommodate spammers with low cost and with bulk registration features that are terrifying similar to algorithmic domain generation will influence new registrations more than GPDR.

Lastly, the authors seem to fall victim to size bias when they assert that COM is “relatively spam free”.COM is an enormous TLD, with approximately 136M registrations. It’s nearly forty times larger than the largest new TLD (TOP). For comparative purposes, the authors should use the counts of bad domains seen. The count of bad domains seen in COM dwarfs the raw bad domains count found in all of the new TLDs listed in the Spamhaus Most Abused TLD list combined.

The research in this post does not support the findings.

The findings are not relevant to finding solutions to the problem space.

We need to be better than this. Focus on the problem space.