This post is edited from the transcript of a presentation
I gave during an ICANN meeting in 2006. I mentioned it during a
conversation with Chris Davis concerning some of the work he is doing at the
Secure Domain Foundation. The purpose of the experiment was to determine how
much personal identifying information I could collect beginning with Whois and
then drawing information from other publicly accessible records. I found it interesting to review the methodology
and compare to how much more PII can and is obtained today from public records and decided the study was worth a second look.
My objectives for this experiment were to determine the extent to
which I could collect personal contact (not identifying) information, beginning
with information available from domain name registration records that I had
acquired from a Whois records source and complementing this with data from public resources. In so doing, I hoped to establish with confidence whether or not the registrant was a natural person or a person operating
a business from a residence.
Data and Sample
For this experiment, I obtained 20 million registration records from
a data service (six CDs containing zip files that expanded into in ASCII
tab-delineated text files). From these, I extracted a 5,000 registration sample, choosing only
registration records where the registrant or administrative point of contact
address in the Whois record contained the city of Philadelphia, Pennsylvania.
Methodology
The approach I used for this experiment mimics the way a forensic
investigator compares a fingerprint taken at a crime scene against fingerprints
in a database. The investigator tries to find a sufficient number of matching
points to feel confident that the fingerprints match. In my experiment, I
identified eleven data matches that would serve as indicators of personal or
business contact information. I set a threshold of 7 matches as the minimum
number to conclude that a particular registration contact was a personal or a
business contact.
I used publicly accessible resources to collect bits of data about
these registrants:
- A real estate agent research database for the Philadelphia area
called trulia.com,
- The Internet telephony phone directory, whitepages.com,
- Several search engines including Yahoo! and Google,
- Aerial photographs and Google Earth,
- MapQuest maps,
- A companies and industries directory from Hoovers,
- Web sites of registered domains in the sample, and
- Personal familiarity with the geographic region
I chose Philadelphia, Pennsylvania because I had lived there for 30
years. I was very familiar with most of the neighborhoods that I appeared in
registrant or administrative contact addresses. Complementing my familiarity
with the general area, with the maps and photographs, I was able to deduce with
some confidence that this was a residence as opposed to a business
neighborhood. When in doubt, I asked my wife, who studied, interned or worked
in hospitals downtown and in the Philadelphia metropolitan area. I freely admit
she was often the more accurate resource.
Note: This 2006 study was "old school". I used very little automation beyond the initial scripts I wrote to obtain a sample. Much of the experiment involved eyeballing web, photos, or database
query responses. Imagine a fingerprint analyst inspecting physical not digital fingerprint samples against physical fingerprints on record. Today, tools like Maltego can be used to efficiently derive similar results across more and substantially larger data repositories.
Personal Contact
Data I used to determine whether a registrant was a business entity
included:
- Whether the Registrant Name seems to be a personal name served as an
initial matching point,
- Whether the Registrant's Phone number could be identified as a
residential listing using, for example, a reverse phone number search via whitepages.com
(Whitepages.com allows you to search for neighbors, i.e., who lives close to a
telephone number, and I used this to increase my confidence, i.e., “if neighbors
are a natural persons…”). This site also provided a means to distinguish personal
mobile phone accounts from business accounts.
- Whether an apartment number was present in the Registrant Address suggests
that the address is a multi-tenant building. By checking realty or building
zone information, it’s possible to further determine whether the building was
registered as a residential apartment or a multi-tenant (office) building.
- Residential estate listings near the registrant's address (same
street, same block or neighborhood) provided another matching point. This
methodology works well in the United States. Many neighborhoods are zoned
residential or are readily distinguished from photographs or via Google Earth;
for example, a classic American two-story colonial in a neighborhood that I
recognized or that I had visited or had friends is very probably not a business
with many employees in Philadelphia – at least, not legally.
For those neighborhoods that were not easily distinguished as
residential from maps or photographs,or where a residence was an apartment
above a pizza parlor or next to an Asian restaurant, or for a small home behind
an automobile repair shop, using this marker becomes a little bit muddier. When
I saw muddiness of this kind, I didn't count this record as containing a
personal contact.
A registrant's Web site reveals sometimes reveals personal
information. It is quite often the case that an individual who includes his
personal information in a registrant record also hosts a web site that reveals
even more information that marks the registrant as a natural person.
Business Contact
Data I used to determine whether a registrant was a business entity
included
- Having a listing as a public corporation operating under a fictitious
name in the EDGAR database (via Hoovers),
- Having the registrant phone number in a business listing in the
Yellow Pages or a phone number search,
- Having a toll-free (8xx) telephone number,
- Information at the registrant’s web site, and
- Finding a suite number in the registrant contact address.
I again speculated that when the
registrant's neighbors are businesses – as would be the case for a shopping
center or mall – then the registrant is likely to be a business. I was often
able to use my familiarity with the city of Philadelphia to quickly conclude
when the address was a multi-story, multi-tenant, professional or corporate
office building. I could also deduce from aerial photos or Google Earth that certain
addresses were professional office campuses in the suburban. A registrant’s web
site, especially About Us pages, often
provides considerable confirmation data: photos galleries (offices, products,
data or business centers, happy office workers).
Findings
The first set of findings only considers the registrant contact information.
Approximately nine percent (9%) of registrations were personal
contacts that are very likely natural persons. Home operated business contacts
account for three percent (3%) of the sampled registrations. I distinguished home-operated
business registrations from personal contacts in cases where the Registrant
Name was a business operating from a residential address. If you accept the
assertion that that home operated business contacts are also natural persons
then approximately one in eight registrations in the sample reveal contact
information for natural persons.
Fifty six percent (56%) were business contacts and six percent (6%)
were domain name industry businesses (e.g., a broker for those seeking a popular
name and willing to bid for the registration). Thirteen percent (13%) were
records that represented a proxy or some effort by the individual to allow
someone else to host and hold the name as the registrant.
By checking whether the administrative contact information was
different from the registrant contact information, I found administrative
contact information sufficient to identify 13 additional natural persons. I did
not use technical contact information, as the technical contact information in the
majority of records in the sample proved to be contact names and addresses of hosting providers.
Observations and Conclusions
Not surprisingly, the majority of registration records in my
registrations sample contained business points of contact information. However,
using speculative-subjective criteria, I was able to gather sufficient
information from domain registration records to identify approximately 13% of
registration points of contact in my “Philadelphia” as natural persons. Approximately
one in seven registration records contained insufficient information to
conclusively distinguish whether or not contacts are businesses or individuals.
It is possible to use these findings to justify public display of
domain name registration records. It is also appropriate to suggest that
sufficient information relating to natural persons is available from Whois that
some form of protection for these registrants is warranted. It’s possible for
Whois services to support both.