I needed a regular expression that would match http links in html documents and this is what I came up with:
<a.*href=('|")?(http\://.*?(?=\1)).*>\s*([^<]+|.*?)?\s*</a>
It will match anything that:
- <a …>…</a> tags
- a tag has “href” attribute
- value of href has to have matching quotes or no quotes
- value of href has to be http (my requirement)
The interesting par is ?=\1 that will match the quotation (‘ or “) that we started with. This construct is called Positive lookahead with backreference.
The other interesting part is \s*([^<]+|.*?)?\s* for matching of linked text that can have whitespace and new lines but this regex will strip left and right whitespace for easier reading.