This post is edited from the transcript of a presentation I gave during an ICANN meeting in 2006. I mentioned it during a conversation with Chris Davis concerning some of the work he is doing at the Secure Domain Foundation. The purpose of the experiment was to determine how much personal identifying information I could collect beginning with Whois and then drawing information from other publicly accessible records. I found it interesting to review the methodology and compare to how much more PII can and is obtained today from public records and decided the study was worth a second look.
My objectives for this experiment were to determine the extent to which I could collect personal contact (not identifying) information, beginning with information available from domain name registration records that I had acquired from a Whois records source and complementing this with data from public resources. In so doing, I hoped to establish with confidence whether or not the registrant was a natural person or a person operating a business from a residence.
Data and Sample
For this experiment, I obtained 20 million registration records from a data service (six CDs containing zip files that expanded into in ASCII tab-delineated text files). From these, I extracted a 5,000 registration sample, choosing only registration records where the registrant or administrative point of contact address in the Whois record contained the city of Philadelphia, Pennsylvania.
The approach I used for this experiment mimics the way a forensic investigator compares a fingerprint taken at a crime scene against fingerprints in a database. The investigator tries to find a sufficient number of matching points to feel confident that the fingerprints match. In my experiment, I identified eleven data matches that would serve as indicators of personal or business contact information. I set a threshold of 7 matches as the minimum number to conclude that a particular registration contact was a personal or a business contact.
I used publicly accessible resources to collect bits of data about these registrants:
- A real estate agent research database for the Philadelphia area called trulia.com,
- The Internet telephony phone directory, whitepages.com,
- Several search engines including Yahoo! and Google,
- Aerial photographs and Google Earth,
- MapQuest maps,
- A companies and industries directory from Hoovers,
- Web sites of registered domains in the sample, and
- Personal familiarity with the geographic region
I chose Philadelphia, Pennsylvania because I had lived there for 30 years. I was very familiar with most of the neighborhoods that I appeared in registrant or administrative contact addresses. Complementing my familiarity with the general area, with the maps and photographs, I was able to deduce with some confidence that this was a residence as opposed to a business neighborhood. When in doubt, I asked my wife, who studied, interned or worked in hospitals downtown and in the Philadelphia metropolitan area. I freely admit she was often the more accurate resource.
Note: This 2006 study was "old school". I used very little automation beyond the initial scripts I wrote to obtain a sample. Much of the experiment involved eyeballing web, photos, or database query responses. Imagine a fingerprint analyst inspecting physical not digital fingerprint samples against physical fingerprints on record. Today, tools like Maltego can be used to efficiently derive similar results across more and substantially larger data repositories.
Data I used to determine whether a registrant was a business entity included:
- Whether the Registrant Name seems to be a personal name served as an initial matching point,
- Whether the Registrant's Phone number could be identified as a residential listing using, for example, a reverse phone number search via whitepages.com (Whitepages.com allows you to search for neighbors, i.e., who lives close to a telephone number, and I used this to increase my confidence, i.e., “if neighbors are a natural persons…”). This site also provided a means to distinguish personal mobile phone accounts from business accounts.
- Whether an apartment number was present in the Registrant Address suggests that the address is a multi-tenant building. By checking realty or building zone information, it’s possible to further determine whether the building was registered as a residential apartment or a multi-tenant (office) building.
- Residential estate listings near the registrant's address (same street, same block or neighborhood) provided another matching point. This methodology works well in the United States. Many neighborhoods are zoned residential or are readily distinguished from photographs or via Google Earth; for example, a classic American two-story colonial in a neighborhood that I recognized or that I had visited or had friends is very probably not a business with many employees in Philadelphia – at least, not legally.
For those neighborhoods that were not easily distinguished as residential from maps or photographs,or where a residence was an apartment above a pizza parlor or next to an Asian restaurant, or for a small home behind an automobile repair shop, using this marker becomes a little bit muddier. When I saw muddiness of this kind, I didn't count this record as containing a personal contact.
A registrant's Web site reveals sometimes reveals personal information. It is quite often the case that an individual who includes his personal information in a registrant record also hosts a web site that reveals even more information that marks the registrant as a natural person.
- Having a listing as a public corporation operating under a fictitious name in the EDGAR database (via Hoovers),
- Having the registrant phone number in a business listing in the Yellow Pages or a phone number search,
- Having a toll-free (8xx) telephone number,
- Information at the registrant’s web site, and
- Finding a suite number in the registrant contact address.
I again speculated that when the registrant's neighbors are businesses – as would be the case for a shopping center or mall – then the registrant is likely to be a business. I was often able to use my familiarity with the city of Philadelphia to quickly conclude when the address was a multi-story, multi-tenant, professional or corporate office building. I could also deduce from aerial photos or Google Earth that certain addresses were professional office campuses in the suburban. A registrant’s web site, especially About Us pages, often provides considerable confirmation data: photos galleries (offices, products, data or business centers, happy office workers).
The first set of findings only considers the registrant contact information.
Approximately nine percent (9%) of registrations were personal contacts that are very likely natural persons. Home operated business contacts account for three percent (3%) of the sampled registrations. I distinguished home-operated business registrations from personal contacts in cases where the Registrant Name was a business operating from a residential address. If you accept the assertion that that home operated business contacts are also natural persons then approximately one in eight registrations in the sample reveal contact information for natural persons.
Fifty six percent (56%) were business contacts and six percent (6%) were domain name industry businesses (e.g., a broker for those seeking a popular name and willing to bid for the registration). Thirteen percent (13%) were records that represented a proxy or some effort by the individual to allow someone else to host and hold the name as the registrant.
By checking whether the administrative contact information was different from the registrant contact information, I found administrative contact information sufficient to identify 13 additional natural persons. I did not use technical contact information, as the technical contact information in the majority of records in the sample proved to be contact names and addresses of hosting providers.
Observations and Conclusions
Not surprisingly, the majority of registration records in my registrations sample contained business points of contact information. However, using speculative-subjective criteria, I was able to gather sufficient information from domain registration records to identify approximately 13% of registration points of contact in my “Philadelphia” as natural persons. Approximately one in seven registration records contained insufficient information to conclusively distinguish whether or not contacts are businesses or individuals.
It is possible to use these findings to justify public display of domain name registration records. It is also appropriate to suggest that sufficient information relating to natural persons is available from Whois that some form of protection for these registrants is warranted. It’s possible for Whois services to support both.