Removing Subdomains
Tuesday, November 5, 2024
There was an interesting question on the Unix & Linux StackExchange asking how to remove subdomains or existing domains. I thought it would be fun to show a few different approaches to solving this using Factor.
Our first step should be to understand what is a subdomain:
A subdomain is a prefix added to a domain name to separate a section of your website. Site owners primarily use subdomains to manage extensive sections that require their own content hierarchy, such as online stores, blogs, job boards or support platforms.
Common Subdomains
If we’re curious about what common subdomains are, we can turn to the SecLists project – described as a “security tester’s companion” – which maintains a list of common 5,000 subdomains, 20,000 subdomains, and 110,000 subdomains that were generated in 2015 as well as a combined subdomains list that has some additional ones added.
You can download the top 5,000 common subdomains using memoization to cache the result:
MEMO: top-5000-subdomains ( -- subdomains )
"https://raw.githubusercontent.com/danielmiessler/SecLists/refs/heads/master/Discovery/DNS/subdomains-top1million-5000.txt"
cache-directory download-once-into utf8 file-lines ;
And then see what the “top 10” are:
IN: scratchpad top-5000-subdomains 10 head .
{
"www"
"mail"
"ftp"
"localhost"
"webmail"
"smtp"
"webdisk"
"pop"
"cpanel"
"whm"
}
You could remove “common subdomains” – adding a dot to make sure we only strip a full subdomain – by recursively trying to clean the hostname until it stops changing.
: remove-common-subdomains ( host -- host' )
top-5000-subdomains [ "." append ] map '[ _ [ ?head ] any? ] loop ;
And try it out:
IN: scratchpad "www.mail.ftp.localhost.factorcode.org"
remove-common-subdomains .
"factorcode.org"
That works pretty well, but it’s reliant on a scraped list of subdomains that might not be exhaustive, and could become stale over time as the tools and techniques that developers use change.
Observed Subdomains
Similarly, another technique we could use would be to use our own observations about domains, and if we observe a domain being used and then subsequently see a subdomain of it, we can ignore the subdomain.
First, we write a word to remove any item that is prefixed by another, sorting to make sure we see the prefix before the item prefixed by it:
: remove-prefixed ( seq -- seq' )
sort V{ } clone [
dup '[
[ _ [ head? ] with none? ] _ push-when
] each
] keep ;
Second, we can remove the subdomains by using a kind of Schwartzian transform:
- reverse the domain names
- remove the ones that are prefixed by another
- un-reverse the domain names
: remove-observed-subdomains ( hosts -- hosts' )
[ "." prepend reverse ] map remove-prefixed [ reverse rest ] map ;
And then see it work:
IN: scratchpad { "a.b.c" "b.c" "c.d.e" "e.f" }
remove-observed-subdomains .
V{ "b.c" "c.d.e" "e.f" }
Resolving Domains
And, finally, another technique might be to use the Domain Name System to find the rootiest domain name.
First, we use our dns vocabulary to check that a host resolves to an IP address:
: valid-domain? ( host -- ? )
{
[ dns-A-query message>a-names empty? not ]
[ dns-AAAA-query message>aaaa-names empty? not ]
} 1|| ;
And try it out:
IN: scratchpad "re.factorcode.org" valid-domain? .
t
IN: scratchpad "not-valid.factorcode.org" valid-domain? .
f
Second, we write a word to split a domain into chunks to be tested:
: split-domain ( host -- hosts )
"." split dup length 1 [-] <iota> [ tail "." join ] with map ;
And try it out:
IN: scratchpad "a.b.c.com" split-domain .
{ "a.b.c.com" "b.c.com" "c.com" }
Third, we find the rootiest domain that is valid:
: remove-subdomains ( host -- host' )
split-domain [ valid-domain? ] find-last nip ;
And try it out:
IN: scratchpad "a.b.c.d.factorcode.org" remove-subdomains .
"factorcode.org"
IN: scratchpad "sorting.cr.yp.to" remove-subdomains .
"cr.yp.to"
This is available on my GitHub.
It’s fun to explore these kinds of problems!