Detecting Phishing using Visual Similarity

I presented new work on crawling URLs at scale and identifying similar activity using screenshot similarity at DeepSec in November, 2024. I am continuing to work on this topic, using LLMs and crawling optimization to better build it out. I’ll be presenting again at Qubit Prague, May 22, 2025.

My hope is to make this a publicly accessible service before speaking in Prague, but it uses a lot of resources. I run it all out of my house (and on some VPS’s in various places). I’m not yet sure how to make it publicly available without causing a server to melt with enough people using it…

I’m currently building a process and interface to crawl known-bad phishing pages, where I take a screenshot and collect other data. That data is going to be used to find similar-looking screenshots and similar-behaving network traffic from streaming logs of visited URLs. This is initially for a talk I’m doing at DeepSec in Vienna this fall, but I hope there’s enough time to make a live website that people can try out instead of only uploading my code to GitHub.

So far, I’ve made a couple things:

1: A web interface where you can upload a list of URLs and get the approximate physical location of the hosting of those domains/subdomains. This is to support my need to figure out where to put web crawlers.

2: Web crawlers. I run 20 at a time in docker-compose. They accept URLs and domains and will try to get a screenshot and other data. More of them will be built in the locations where I see most phishing coming from.

3: An interface to group similar screenshots together in order to easily delete those that don’t contribute to a malicious dataset and to easily keep those that can be added to a malicious dataset.

And I’m working on:

  1. A sort of ‘analyst interface’ that shows information from the malicious dataset and then any hits for similar websites, all on a timeline. This is a lot of javascript and I do not like javascript…

I’ve been thinking about this for a while. I have a fantastic job working with Cisco Talos and hope to keep doing it for a while. My team is great and the work is fulfilling, challenging, fun, and satisfies my passions to help make cybersecurity better for a lot of people and organizations.

I also want to enable small businesses, non-profits, and even individuals to better secure themselves against malicious activity. Almost all my research is me trying to build tools that can be used by smaller organizations, but none of it is really a ‘product’ yet. I am working towards building various products and services that can be provided on a sliding scale price. Hopefully I can find customers for some of these, and then can use any money that’s made to fund scaling up the security of places that either can’t afford or don’t even think about security.

Large organizations have the resources to hire threat researchers, security engineers, build SOCs, and more, but the smaller ones are left to purchase (often) sub-par services that don’t deliver the same quality. We hear about breaches or ransomware or some other attack on large organizations where the initial attack vector was through a smaller contracted organization. So we have resource-heavy organizations who have the security defenses they need to protect themselves, but then we have small businesses with few or any resources handling some part of their business. It’s a major gap in security and needs to be dealt with.

I believe large security organizations should be providing the same amazing services they charge a lot for to these smaller organizations for sliding-scale fees, and in some cases, completely free of charge.

So my goal/dream is to build an organization where enterprise-level services are provided at enterprise prices to paying customers, but the money flows down to those that can’t afford it. Large organizations would benefit from a wider security net across industries and smaller organizations will be more secure.

I don’t yet know when it will happen, but I hope to eventually be able to turn Pyosec into a security company while funneling all low cost and free services through a non-profit. This would likely require me not having a regular day job, but for now I have a lot more work to do with my team and don’t intend to leave any time soon. In the meantime, I will continue thinking and strategizing how this will eventually come to fruition while doing what I can in my free time to continue research, giving presentations, and educating/working with non-profits and small businesses.

If you’ve found this site and are interested in having a chat about your organization or security posture, please reach out via the contact page. This isn’t me trying to sell something/build sales leads. I have skills and knowledge and want to help.

I had the wonderful opportunity to once again present my current work at Deepsec, in Vienna in November, 2023. I presented new work on URL Analysis at Scale.

The research resulted in building a web app and API that can use spelling, natural language processing, machine learning, and some other techniques to quickly analyze large lists of URLs or streaming URLs to find the ones that are likely malicious. It scales by using rabbitMQ and AWS Lambda’s to increase processing power as needed.

It took me about 3 months to create the project, and then I took a few months off after presenting. However, I’m back to working on it. My plan is to create a publicly accessible web app/API where others can use the detection I’ve built in.

The presentation slides can be found at https://pyosec.com/research/.

I used data from the (now defunct) malware wiki and cyber.nj.gov to create this timeline, which I keep up to date when possible. The timeline is generated using timeline.knightlab.com. The data I managed to collect from the malware wiki before it disappeared can be downloaded as a CSV here

I presented on automating threat intelligence yesterday at QuBit in Sofia, Bulgaria.

This was my first time giving this presentation, and as usual (for me), I was coding up to the moment I walked on stage. I thought it went really well and learned a lot from the audience. What I learned will feed back into additional research on my attempt to automate myself out of a job!

If interested in my code and slides, they can be found in the Research and Presentations section.

Phishing is an efficient method for an attacker to deliver malware or harvest credentials from unsuspecting victims. By sending out a mass or targeted email designed to look like it came from a bank or other legitimate source, an attacker can acquire a fair number of user credentials or deliver malware. Credentials can be used for identity theft, additional compromise or to send more seemingly legitimate phishing emails and convincing a user to install malware can give attackers access to a system.


Phishing will typically use domains from one of three sources:

  • Free hosting providers, often the most basic of phishes,
  • Paid hosting, typically used for targeted attacks. In an attempt to appear more legitimate, an attacker may use a domain that is similar in name to the domain they’re impersonating,
  • Compromised hosts or registrars. In these cases, a website is compromised and phishing content is hosted deep within the site
  • or the registrar is compromised and subdomains are configured to point to phishing content on the same or different servers.

To get an idea of what kinds of domains phishing attacks are using at present, We’ve analyzed a portion of data from phishtank.com.

Phishtank is a website run by OpenDNS where members submit potential phishes for review by other members of the community. When enough votes confirm a phishing attack, it is labeled as a verified phish. Phishtank is a relatively small slice of phishing content on the internet. We are only looking at a data set of just over 3 million reported phishing attempts. However, looking at the verified phishing attacks for just this month, we are able to see some basic patterns.


To get this data, we downloaded a copy of the verified phishing attempts that were online as of this month from the statistics page at phishtank.com and performed analysis on the data using python. With the Uniform Resource Name (the part after domain.com/), we were left with domains and subdomains. We then analyzed those using the OpenDNS Investigate API to collect ASN organizational information for each unique domain. That provided a summary of organizations responsible for domains hosting phishing content.

As of this writing, 3,256,785 phishes have been submitted to phishtank and 1,837,862 of those have been verified as valid.

31,219 are currently listed as online. In our analysis, we used only the second-level domain names from all the currently online phishes and removed duplicates, leaving 9,902 unique domain names.

1,072 of these domains had no organizational attribution as they no longer resolved to an IP address, leaving us with 8,830 domains still attributed to an ASN.

The following is a graphical view of the top 10 organizations with the most phishing content:


Let’s take a look at the worst offender in this analysis, CyrusOne.

CyrusOne provides colocation services, so they may not be directly responsible for maintaining the compromised or purchased hosts that are used in phishing attacks. They may be the leader in phishes from this data set at the moment simply due to their size, with two dozen data centers across the United States, Europe, and Asia.

Looking at specific domains from this set, we can see how phishing attacks operate when targeted or when using compromised or free hosting.

Targeted Hosting

serviceyourpaypal[.]com

This domain appears to have been purchased specifically for use in targeted phishing attacks with the goal of acquiring PayPal credentials and stealing money from PayPal customers.

serviceyourpaypal[.]com was registered on September 14, 2014 at launchpad[.]com. It’s using domain privacy services provided by privacyprotect[.]org to hide administrative and technical details for the person or organization who bought the domain name.

It is hosted at Hostgator, a well known and inexpensive hosting provider and is using a shared host at the IP address of 192.185.4[.]25. This IP address is hosting a total of 369 domain names.


We can see that there is a consistent, but small amount of DNS requests for this domain when looking at its requests through OpenDNS infrastructure. The domain is not serving any useful content at present, as can be seen in the following image:


However, serviceyourpaypal[.]com could be re-activated at any time and used in future PayPal-themed phishing campaigns. Because of its name similarity to paypal[.]com along with using an ASN other than what legitimate PayPal domains use.

applesverifications[.]com

applesverifications[.]com was registered on September 2, 2015 at launchpad[.]com and does not hide it’s whois information behind a privacy service. That doesn’t necessarily mean it’s factual. In some cases, adding whois privacy costs extra when registering a domain. The domain is hosted with Hostgator and its IP address hosts a total of 907 domains. It had the following content when last analyzed:


The DNS traffic had a very suspicious spike in traffic on May 10, 2015 after small and consistent amounts of DNS traffic, potentially indicating other campaigns or testing prior to this specific phishing campaign.


Compromised Hosting

bankruptcylawyershawaii[.]net

bankruptcylawyershawaii[.]net appears to be a legitimate website, but was compromised at some point and used in an attempt to harvest credentials with the following phishing page:


Looking at the html source of this page, we can see that clicking the ‘Verify’ button will send credentials to the file: weba-akp.php, which is stored locally on the website. This is the standard behavior in most commodity phishing attacks in which the phish utilizes a compromised site. Often, credentials are sent to an email that’s configured statically in the php or other file with code designed to be run on the server.


The domain was registered on March 21, 2014 at godaddy[.]com. The whois data is not hidden as it was with the more targeted serviceyourpaypal[.]com.

The domain is using private nameservers provided by Hostgator. These name servers are used by customers of Hostgators reseller, dedicated and VPS hosting plans. The IP address this domain uses as its A record is hosting a total of 11 domains.

When viewing DNS requests, it’s impossible to miss the suspicious spike in traffic around April 18. That is most likely when this phishing campaign was active.


Free Hosting

upgrade2015a.wix[.]com

The next phish was located at the free hosting provider, wix[.]com. Anyone can use wix[.]com to host a free website. As of June 29, 2015, the following phishing page was online at upgrade2015a.wix[.]com:


Wix[.]com is hosted at GoDaddy and owned/administered by Incapsula. Incapsula only had 17 domains seen used in phishing from this data set and wasn’t actually part of the top 10 worst ASN’s, but it’s a good example of free hosting being used in phishing.

Looking at the DNS requests for this subdomain, there is an obvious change in the requests which suggests this campaign started on June 26, 2015. There may have been some testing on June 23, when we see only a few requests.


Conclusion

Using just a small sample of reported phishing content, we can capture a fairly good picture of which hosting providers may be more vulnerable to compromise or more forgiving of malicious behavior. This information can be useful when considering where to host your website or online service. Additionally, just a quick analysis of data from Phishtank can be used to build a training set of indicators to look for when working to protect users across a network.