Matching links with regular expression in HTML

I needed a regular expression that would match http links in html documents and this is what I came up with:

<a.*href=('|")?(http\://.*?(?=\1)).*>\s*([^<]+|.*?)?\s*</a>

It will match anything that:

    • <a …>…</a> tags
    • a tag has “href” attribute
    • value of href has to have matching quotes or no quotes
    • value of href has to be http (my requirement)

The interesting par is ?=\1 that will match the quotation (‘ or “) that we started with. This construct is called Positive lookahead with backreference.

The other interesting part is \s*([^<]+|.*?)?\s* for matching of linked text that can have whitespace and new lines but this regex will strip left and right whitespace for easier reading.


Avtor: Anonymous, objavljeno na portalu SloDug.si (Arhiv)

Leave a comment

Please note that we won't show your email to others, or use it for sending unwanted emails. We will only use it to render your Gravatar image and to validate you as a real person.