IV image links

IV image links


The Checklist has a rule about image links:

In IV 2.0, <img> tags support the optional attribute href to make the image clickable. This should only be used if the link behind the image leads to some different page or content. E.g., if the link opens the same image in a higher resolution, it should be ignored.

To satisfy this rule we need to preserve image link while filtering out link direct image links.

That all works great when a website wraps images in links of only one of those kinds, or when markup is different between the two.

But often we find ourselves in a situation where we don't have different HTML structure or CSS classes to tell us whether we should set the href attribute or not. Basically, all we have to work with is values of href and src attributes.

So, what can we do?

First let's try to see what that rule really means:

  • Why do we need image links? Because they carry information for the reader. If we remove them, the information is lost, which is a critical issue.
  • Why shouldn't we include direct image links? Because they are hurting the UX. If we do not remove them, an icon on top of the image will suggest the reader that there is something behind the link, which in fact is the same image. Still a serious issue, but if we were to choose which one is worse, it would be missing content.

So, when it's not really possible to distinguish between useless and useful links, it's better to include a link than remove it.

But we shouldn't give up on the problem just yet and go set href unconditionally. Let's look into options:

  1. Do not set href if it contains/ends with known image file extension.
    This is very fragile logic, because it's perfectly valid for an URL to contain (even at the end) any file extension. For example, here's an image from wikipedia:
    It's a web-page with quite a bit more content than just an image, but it ends with ".jpg". So it'll result in false positives, which in turn lead to missing content.
  2. Do not set href if it matched the src (or one of srcset values).
    This one is better because it's backed up by the page source code. In practice it can result in false negatives because it's possible for an image to be accessibly by different URLs. Still it's better that the first option because of the relative issue severeness.
  3. Do not set href if path or filename (portion that comes after final forward slash in the URL) is the same for href and src.
    This option might be a good balance between the previous two. It filters out more direct image links (even on different domains) while giving a pretty low probability of false positives. Still, it's better to do more page checking before going with this option.