Excerpt word count : How does it work?

Hi,

I’m trying the $page->text()->excerpt(100, 'words') on a field and it’s working partially / have a strange behaviour.

Words are sometimes cutted in the middle ( I try precisely to avoid with “words” instead of “characters” method) and on some field the 10 word limit cut at 7 for example.

How does this function work exactly ? What is counted and what is not ?

Thank you !

Kirby uses the native str_word_count() function internally. It doesn’t just split by whitespace, but for example the string fri3nd becomes fri and nd according to the PHP docs.

Could you please post a few examples where it breaks for you?

Unfortunatelly I can’t as it’s a confidential project for the moment, but the words are just plain alpha characters, not even accentuated or with diacritics (like architect) I can’t figured out what’s the problem.

OK. But then you can maybe create similar non-sense strings that also fail?

Ok I picked up some dummy text on wikipedia and the problem is still here:

Raw text:

Donkey Kong Country (DKC, Super Donkey Kong au Japon) est un jeu de plates-formes développé par Rare et édité par Nintendo à partir de novembre en 1994 sur Super Nintendo. Premier jeu de la série Donkey Kong Country, il met en scène les personnages de Donkey Kong et de Diddy Kong. Le jeu est adapté sur Game Boy en juin 1995 sous le titre Donkey Kong Land. Il est porté sur Game Boy Color en 2000, puis en 2003 sur Game Boy Advance. Le jeu est disponible dans sa version originale en téléchargement sur la console virtuelle de la Wii depuis décembre 2006, celle de la Wii U en 2014 et celle de la New Nintendo 3DS en 2016.

L’intrigue prend place sur l’île Donkey Kong Island sur laquelle vit la famille Kong. Durant la nuit, les Kremlings dérobent le stock de bananes de Donkey Kong, que son neveu Diddy Kong est chargé de protéger. Donkey et Diddy Kong, qui partent à sa recherche, parcourent les diverses zones de l’île (la jungle, un lagon, des cimes d’arbres, des grottes, un temple, des montagnes enneigées) et s’opposent aux Kremlings jusqu’au galion du Gang-Planche, où ils affrontent le chef King K. Rool afin de récupérer les bananes lâchement volées.

Cutted like that (50 words):

Donkey Kong Country (DKC, Super Donkey Kong au Japon) est un jeu de plates-formes développé par Rare et édité par Nintendo à partir de novembre en 1994 sur Super Nintendo. Premier jeu de la série Donkey Kong Country, il met en scène les personnages de Donkey K…

I can understand the “bad” word count (at ~48 words) with some problematic characters but I can’t understand the cut in the middle of a standard word.

Yep I was doing it when you posted.

Also sometimes the last space is included, sometimes not, which is problematic as the ellipsis is sometimes just next last words and sometimes after a space.

1 Like

It’s strange; when I count the words at http://php.fnlist.com/string/str_word_count I got three more words (considering that K… is not a word right here).

Part of the problem here is that the word count gets wrong whenever a character with diacritics appears, e.g. scène etc. The second problem related to that is, that the position of the string gets wrong as well. That explains why sometimes there is an empty space, sometimes some letters etc.

This is part of the array you get if you do a var_dump() after line 116 of helpers.php:

array (size=117)
  0 => string 'Donkey' (length=6)
  7 => string 'Kong' (length=4)
  12 => string 'Country' (length=7)
  21 => string 'DKC' (length=3)
  26 => string 'Super' (length=5)
  32 => string 'Donkey' (length=6)
  39 => string 'Kong' (length=4)
  44 => string 'au' (length=2)
  47 => string 'Japon' (length=5)
  54 => string 'est' (length=3)
  58 => string 'un' (length=2)
  61 => string 'jeu' (length=3)
  65 => string 'de' (length=2)
  68 => string 'plates-formes' (length=13)
  82 => string 'd' (length=1)
  85 => string 'velopp' (length=6)
  94 => string 'par' (length=3)
  98 => string 'Rare' (length=4)
  103 => string 'et' (length=2)
  108 => string 'dit' (length=3)
  114 => string 'par' (length=3)
  118 => string 'Nintendo' (length=8)
  130 => string 'partir' (length=6)
  137 => string 'de' (length=2)
  140 => string 'novembre' (length=8)
  149 => string 'en' (length=2)
  157 => string 'sur' (length=3)
  161 => string 'Super' (length=5)
  167 => string 'Nintendo' (length=8)
  177 => string 'Premier' (length=7)
  185 => string 'jeu' (length=3)
  189 => string 'de' (length=2)
  192 => string 'la' (length=2)
  195 => string 's' (length=1)
  198 => string 'rie' (length=3)

The keys indicate the position of the string.

So I think to get this right, all critical characters in a text would have to be replaced first to get the number of words and the position right, before applying str_word_count().

I created an issue on GitHub.

1 Like

I have thought that diacritics may cause problems but other texts have diacritics and the count is right so I dismissed it.
Good find !

If the problem remains, there are alternatives.


I have the same issue.

Hi, I have the same problem using excerpt(words) any news about this issue?

Did you try the stuff @jenstornell mentioned? My experiences with strrpos are extremely well so far!

No, I see there is an issue pending on Github about this, and I asked if there is any news.

I will take a look at strrpos.

Up to this day, I didn’t encounter any problems, but maybe that’s just my luck.