IV newsweekjapan.jp pagination problem

IV newsweekjapan.jp pagination problem

φ

A bit of background

In the Web there is a conception of canonical URL.

The same page may be accessible via multiple different URLs, and a canonical URL is used to identify the resource as the same one.

Web-pages may specify their canonical URL using the canonical link element.

Automated tools, such as search engines or, in our case, the IV engine, use that information to deduplicate search results, indexes, etc.

So does the IV engine. When it sees an URL, it loads its content, determines the canonical URL and then reloads content using this canonical URL instead of the initial one. You can try it out by pasting an URL with additional params into the contest UI: you'll be redirected to the canonical URL for this page.

This behavior applies not only to initial page URLs but to iframes too. When you try to @inline an iframe, the IV engine will give you contents of the canonical URL for specified src.

The problem

Now, the problem is, newsweekjapan.jp uses canonical link elements incorrectly: for multi-page articles it includes <link rel="canonical"> that points to the first page of the article.

Example:

Here's a multi-page article link:

https://www.newsweekjapan.jp/florent/2019/02/post.php

In the source code you can see:

<link
  rel="canonical"
  href="https://www.newsweekjapan.jp/florent/2019/02/post.php"
/>

Here's the second page of the article:

https://www.newsweekjapan.jp/florent/2019/02/post_2.php

And its canonical URL (same):

<link
rel="canonical"
href="https://www.newsweekjapan.jp/florent/2019/02/post.php"
/>

Now try this URL in the contest UI. As mentioned above, you'll be redirected to https://www.newsweekjapan.jp/florent/2019/02/post.php. Same for any other page of the article.

For IV, it means, that if someone shares a link to any page except the first via Telegram, IV will be generated *always for the first page*. What's worse, it will happen silently, without neither the person who shared the link nor the one who opened it in IV knowing that they are looking at content that is not the one they expected.

The solution

This problem must not be ignored, because, as stated in IV goals and clarifications from the previous contest, generating incorrect IV is worse than not generating any, and generating incorrect IV which looks like a correct one is worse still.

Since the template is executed after the canonical URL resolution and loading, and since we can't work-around it using iframe inlining, the only solution is to not generate IV for articles with pagination.

This website has enough articles without pagination to make IV useful anyway. And (though the problem is not on the IV engine side) if Telegram ever finds a way to resolve this issue, we'll need to update the template to cover more pages. But until then, we must stick to ones that we can meaningfully cover.

Report Page