Getting the Links in an HTML Document

// This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
// or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
public static String[] getLinks(String uriStr) {
    List result = new ArrayList();

    try {
        // Create a reader on the HTML content
        URL url = new URI(uriStr).toURL();
        URLConnection conn = url.openConnection();
        Reader rd = new InputStreamReader(conn.getInputStream());

        // Parse the HTML
        EditorKit kit = new HTMLEditorKit();
        HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
        kit.read(rd, doc, 0);

        // Find all the A elements in the HTML document
        HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
        while (it.isValid()) {
            SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

            String link = (String)s.getAttribute(HTML.Attribute.HREF);
            if (link != null) {
                // Add the link to the result list
                result.add(link);
            }
            it.next();
        }
    } catch (MalformedURLException e) {
    } catch (URISyntaxException e) {
    } catch (BadLocationException e) {
    } catch (IOException e) {
    }

    // Return all found links
    return (String[])result.toArray(new String[result.size()]);
}

Comments

18 Jan 2011 - 5:07pm by d34d_d3v1l (not verified)

tested, but didn“t work...

17 May 2011 - 2:36pm by Vic (not verified)

Lame !
A basic pattern does the same.. even better...
Why should I bother using this spaghetti?

12 Oct 2011 - 7:25am by XooX (not verified)

@Vic

LOL. How basic pattern? I don't know any basic pattern, which can parse HTML links.

8 Nov 2011 - 12:50am by Christian Louboutin Ambassador (not verified)

There is a popular and fashionable element, our goal is to be popular and fashionable vane, offer the best products and service, the most preferential price.

24 Dec 2011 - 1:56am by Solyn (not verified)

Thank you so much for this atrlcie, it saved me time!

11 May 2012 - 7:05pm by Anonymous (not verified)

If it doesn't work:
Right after this line: HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
add this one:
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

That should do it!

Post a comment

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image. Ignore spaces and be careful about upper and lower case.