Unidentifiable “Related articles”

Unidentifiable “Related articles”

Eaxon

Comment to the issue: https://instantview.telegram.org/contest/moreechampion.com.au/template5/issue3/


It’s not possible to reliably handle 100% of similar cases and trying to do it may eventually result in loss of content or a broken markup.


1.

The website’s writers may use “RELATED”, “SEE ALSO”, “ALSO MAKING NEWS”, “MORE ABOUT %s” and many other titles for these hardcoded blocks. They may also simply change the case: “Related story:”, “On this topic:”, “Continue reading:”, etc. You need dozens of ‘contains’ functions to support all of those — and you still miss one else.


Just a few pages where your template does not format related articles (links are pointing to the template):

Of course, you can add support for 3 of these, but would you really add support for “MORE ABOUT MOREE SENIORS FESTIVAL”?

Also note that the markups differ too: while most of the titles are placed in the <b> tag, the “RELATED STORY” title is placed in the <a> tag along with link’s text so you have to mannualy enter it (or use @clone).


2.

2.1. What’s more crucial is that the markup may look like that (Trust me! It may! I’ve seen a lot of… markups during the contest):

<div class="assets">
  <p>However in the 12 months to September 2017, all but two of the 17 crime categories saw an increase in Moree.</p>
  <p><b>RELATED ARTICLES:</b><br></p>
</div>
<div class="assets">
  <ul>
    <li><a href="http://example.com">Link #1</a><br></li>
    <li><a href="http://example.org">Link #2</a><br></li>
  </ul>
  Domestic violence-related assaults and indecent assaults were the only categories which recorded a drop over the 12-month period, down by 18 and 19 incidents respectively.
</div>

In this case, we must first find p with the title (using a lot of ‘contains’), then check whether it’s the last element in its div, if it is — @after_el it, check if ul with links is the first element in its div, if it is — @before_el it, and only then @combine; otherwise, two texts (“However in the 12 months…” and “Domestic violence-related assaults…”) will be lost.


2.2. More likely case:

<div class="assets">
  <p>
    <b>READ MORE: </b>
    <a href="http://example.com">Link</a><br>
    The junior champion was Tyler Stolzenberg, while Chloe Gillogly was awarded the junior encouragement award.
  </p>
</div>

Here it’s risky to convert the whole p into related as it may contain other contents. So we should do [contains(text(), "READ ABOUT THIS") or contains(text(), "READ ABOUT THAT") … or contains(text(), "READ ABOUT PLUTO")] again, then wrap it in related and @combine.


We also hardly can omit using the ‘contains’ function because the markup may be like that:

<p>
  <b>View other items on her website: </b>
  <a href="http://example.com">Link</a><br>
</p>
<div>
  <p><b>TOP 3 ONLINE POTATO SHOPS:</b><br></p>
</div>
<div>
  <ul>
    <li><a href="http://potatostore.com">Potato Store</a></li>
    <li><a href="http://play.potato.com">Potato Play</a></li>
    <li><a href="http://superpotato.com">Super-Duper Potato</a></li>
  </ul>
</div>

In similar cases, all links for which IV cannot be generated will be lost.


By the way, I got an error message once trying to use 3 (three) ‘has-class’ functions with one tag, so I had to divide that statement into 3. It means that we may be eventually forced to search for links about Moree Seniors Festival and links about Pluto separately… And after all — we still miss one more block whose title is still undiscovered.


So is it worth it to format only a part of such “Related articles” blocks forcing the IV engine do lots of unnecessary work and at the risk of losing contents?


Report Page