Combatting E-mail Fraud: The Phishing Net

"Phishing" is the term coined by hackers for attempting to lure personal information out of people by pursuading them to visit web sites that look like genuine bank, credit card, or payment sites, when they are actually sophisticated fakes of those sites.

This tries to give a description of roughly how the phishing net works. It is pretty complicated, so this description can't be perfect.

Many of the items listed below handle "obfuscations" (attempts to disguise the real text) of text and URLs. These include swapping letters around, using letters that look very like other letters, using ";" instead of ":", using "," instead of "." and many tricks like that. I have tried to highlight which rules handle obfuscations, but I have not given the details of exactly what the rule will accept. There are many many variations on the expected text that will be detected.

    Keep track of all <BASE> tags as they provide a root URL for every relative link on the page.
    Attach the <BASE> URL onto the front of all relative URLs contained in every link on the page.
    Look for links contained in imagemaps. The imagemap may be inside a link to a safe site, and contain an image of the text of the name of the safe site. But it can have a rectangle defined in it, whose link destination is a fraud site. Reduce these by removing imagemaps so the real destination of the link is used instead of the apparent destination.
     
     
    Real destination or
    apparent destination
    Operation
    apparentConvert to lower case.
    apparentAllow for links that look like Microsoft's ADO.Net, ASP.Net and other .Net functionality.
    apparentRemove %a0 encoded characters (hard space).
    apparentDecode all %-encoded characters.
    apparentRemove all white space.
    apparentRemove all leading numbers in square brackets.
    apparentChange any \ to / as many browsers do this quietly to help Windows authors.
    apparentRemove all HTML tags.
    apparentRemove the username part of email addresses.
    apparentRemove all &-encoded symbols such as < and >.
    apparentRemove leading &lt;.
    apparentRemove trailing &gt;.
    apparentConvert all & characters to their international equivalent.
    realConvert to lower case.
    realRemove %a0 encoded characters (hard space).
    realDecode all %-encoded characters.
    realForce "safe" result if it does not contain either a . or a /.
    realRemove all white space.
    realChange all \ to / as many browsers do this quietly.
    realForce "safe" result if it is an email address.
    realRemove trailing dots and commas and other punctuation.
    realRemove leading [numbers].
    realRemove all HTML tags.
    realRemove "blocked::" labels as inserted by some other products.
    realRemove "outbind://" labels as inserted by some other products.
    realInsert the BASE url if the link is relative and the BASE url is defined.
    realRemove any leading http:// or ftp:// or obfuscations of those, including replacing the : with a ;.
    realForce "safe" result if it is a mailto: link.
    realRemove everything after the first / or ?.
    realRemove any trailing br, p or ul tags.
    realForce "safe" result is it is a file: link.
    realForce "safe" result if it is a link to somewhere else in the same page (internal link).
    realRemove any trailing /.
    realForce "dangerous" result if URL contains any non-printable-ASCII characters.
    realIdentify JavaScript links.
    apparentContinue searching if any of these are true:
    1. it starts with the letters usually used at the start of a website name, e.g. www, ftp and any mis-spellings or transpositions of these,
    2. it ends with .com, .org, .net, .info, .biz, .ws or other strings which appear to look like this,
    3. it ends with .com or .co followed by a 2-letter country code,
    4. it starts with http: or ftp: or mailto: or any mis-spellings or obfuscated versions of these,
    5. you are looking for numeric ip addresses (Phishing By Numbers) and the link contains no < nor > nor g-z characters.
    apparentRemove leading strings that look like http:, ftp: mailto: and other obfuscations of these.
    apparentRemove everything after the first /.
    apparentRemove all trailing . characters (and obfuscations).
    apparentAdd www. on the front unless it already starts with www, ftp, mailto or obfuscations of these.
    realForce a "dangerous" result if Phishing By Numbers and link is numeric (IPv4 and IPv6).
     
     
    bothCompare the apparent destination with the real destination, with an optional www on the front.
     
     
    If they do not match, and the real address is not in the Phishing Safe Sites file, trigger a "dangerous" result.

    Less Strict Phishing Net

    The less strict phishing net basically does the same process as above, except that it has a list of all the "generic" domains in every country around the world, such as ".com", ".co.uk", ".mil.es" and so on.

    It chops the generic domain off the end, and uses the last remaining element as the name of the company or organisation owning the domain. If the displayed URL contains the same organisation name as the real URL then the result is considered to be a safe link.

    So, for example, "http://www.mycompany.co.uk" with a real URL of "http://tracker.mycompany.co.uk" would be considered safe, but "http://www.othercompany.co.uk" would be considered dangerous, and would be highlighted.

    The result is slightly less strict checking, but enormously less false alarms caused by companies that like to monitor exactly who clicks on what by using multiple servers.

    If you are running an ISP, I strongly recommend that you run the "less strict" phishing net.