I remain skeptical of all the Whois studies that I’ve reviewed (FTC, SSAC, ICANN), including studies where I was a party to the research. I’ll apologize for failing to contribute to a satisfactory Whois study. I’ll also admit that my understanding of how to study a problem scientifically has greatly expanded over the past ten years.
A truly scientific Whois study should meet scientific must meet certain common criteria. The purpose should be clearly defined; in particular, the researchers or parties who commission the research should make certain that they are asking the right question.
Before I raise anyone’s brows or blood pressure, I’m not suggesting that these studies lacked rigor.
The researchers were asked to answer a question and they did.
I believe that the wrong question was studied.
The moment you ask, “Does public access to Whois-published data lead to a measurable degree of misuse?” or “Is the WHOIS Service a Source for email Addresses for Spammers?”, you’ve constrained the research scope too narrowly. You’ve also set up a strawman. There’s a difference between asking the wrong question and asking the right question but in the wrong way
These studies had different findings. They appear to have been conducted scientifically. The Whois-specific findings and conclusions do not provide a satisfactory comparative analysis of Whois personal data harvesting versus other methods of collecting personal data; in particular, they do not help us understand
How much of a threat is personal data harvesting using Whois relative to harvesting using other sources?
Why is this the relevant question? Databases that contain personal data are exposed to multiple threats. To better understand the personal data harvesting issue, we need to consider not one but many of the known personal data harvesting methods that are used by criminals or misused commercially. The study should compare a known set of methods of personal data collection and rank Whois among these. The list of methods that cybercriminals or commercial interests use to collect personal data might include:
- Database breaches, e.g., an attack resulting in the theft or disclosure of tax records or filings, financial accounts, credit card accounts, merchant accounts, bank logins, corporate (HR, ERM, CRM) or government user accounts.
- Crawling and Search, e.g., visiting hundreds of thousands of web sites and extracting email addresses or other personal data from the pages visited, and employing sophisticated or advanced searches across billions of publicly accessible pages that search engines index.
- Traffic or message capture, e.g., capture of personal data from unencrypted or compromised web (http) sessions, email transmissions or file transfers, or the exfiltration of data by malware through covert channels.
- Email user enumeration attacks, e.g., brute force or “dictionary attacks” that harvest active accounts, and which are succeeded by password cracking attacks against enumerated users.
- Malware, the collection of personal data from contacts files by malware that has infected a computer or mobile device, and Ransomware, the extortion of financial account information or payments through malware that has infected a computer or mobile device.
- Whois queries, the collection of point of contact information using the public Whois service.
- Social Engineering Attacks including Phishing, the luring of victims to impersonation financial or corporate login pages where the victim submits personal or financial account information; Doxing, a social engineering and information harvesting of personal information, especially from social media sites; and Business Email Compromise (CEO Fraud), the use of social engineering and email account impersonation to acquire financial data or to hijack or perform fraudulent wire transfers.
- Disclosure by third parties that aggregate and sell or share personal data, without notice or consent.
The ICANN study only asked about a single threat: Whois. This give us the same kind of insight as a study of cancer in which the question posed is “is X a cause of cancer?” The X study may give us an answer, but it will not be a very comprehensive one. Questions such as “what are the possible causes of cancer, which are most dangerous (aggressive), and which are the causes that we are most likely to be exposed to?” provide more and deeper insights. For example, a study may find that eating red meat causes cancer, and another study may find that extended exposure to the sun does, too, but we want to understand which of these and other possible causes of cancer is the greater or greatest risk so that we can weigh the risks and most importantly focus our attention on mitigating the worst among the risks.
If a study were to be conducted using the threats I listed above, I’m willing to speculate that the findings would rank Whois fairly low, for the reasons explained in the following table:
Threat to Personal Data |
Risk |
Popularity |
Data Quality and Value |
Reason |
Database Breaches |
Medium-High |
High |
High |
Databases generally have complete and accurate personal data: contact data as well as account and verification (e.g., CCV) data. Value of |
Malware, Ransomware |
High |
High |
High |
Lax security/encryption practices, vulnerable software, and poor configurations expose data malware attacks, unintentional leaks (e.g., posts to public not friends). Difficulty: LOW. Malware, ransomware can be downloaded from open repositories, social media pages or Dark Web. Attack networks can be leased from spam infrastructures (e.g., Avalanche) or botnet herders. |
Social Engineering Attacks: Doxing, BEC, Phishing, Advanced Fee Frauds |
High |
High |
High |
Highly sophisticated attackers collect personal data from targeted individuals, businesses or government agencies. Difficulty: High. Doxing and BEC in particular require direct or personal exchanges between attackers and targets, with typically lucrative returns. |
Disclosure by 3rd parties |
Medium-High |
Medium-High |
Varies |
Notice of use or informed consent is poorly practiced, oversight (until GDPR) is poor. Users make poor or uninformed choices based on cost or desirability of service. Difficulty: Varies. Collection is not difficult, but 3rd parties invest in measures to hide this activity from public or regulators. |
Crawling and Search |
Medium |
High |
Varies |
Anything data can be indexed can be found using advanced searches and disclosed or misused. Automation (scripts) can collect and extract email addresses or other personal data from millions of web pages with ease. Difficulty: LOW. Automation can be downloaded from open repositories, social media pages or web pages. |
Traffic or Message Capture |
Medium |
Medium |
Varies |
Traffic capture today is sometimes state-actor sponsored and typically a patient and long-term extraction of intelligence. The completeness and accuracy of any personal data collected depends on the targeted environment. Difficulty: MEDIUM. These attacks are malware-driven. Some is custom, some available through open repositories or Dark Web. |
Email User Enumeration |
Medium |
Medium |
Medium |
These attacks are conducted to generate email address lists or to gather intelligence or to acquire targets. Financial data and other personal data cannot be collected by this means. The lists are much less valued in underground marketplaces than database breach data. There is a market for such synthesized mailing lists but the parties who purchase them often end up blocklisted because the sellers often collect spamtrap addresses. Difficulty: LOW. Automation can be downloaded from open repositories, social media pages or web pages. |
Whois queries |
Medium |
Low |
Poor |
Whois point of contact information is often inaccurate. Whois does not contain financial data. The lists are much less valued in underground marketplaces than database breach data. Difficulty: LOW. Automation can be downloaded from open repositories, or web pages. Whois rate limiting impedes collection. |
The FTC study attempted a comparative harvesting analysis in 2002. This was the right context. The FTC asked the right question. Let’s repeat the study, with the caveat that the list of prominent attack vectors should be modernized to the list of today’s popular harvesting methods, and the procedures, measures, and data analysis should be reviewed to assure that the new study is scientific.
A scientific study of this breadth would allow us to assess the composite risk of private data disclosure as well as the individual risk through each of these collection methods. It would give us an opportunity to use data to drive informed consensus policy development for an important issue.