Search: crawl instead of search markdown content files


#1

In some websites I create “hidden sections” or “hidden subpages” to manage little pieces of structured content that are not directly accessible, but used in another page. They are usually not directly accessible because they are “too small” to create a dedicated page out of them.

How can I work around the built-in search, which returns the page where the content is “stored” (which is usually hidden/not directly accessible) instead of “used”.

How do you work with this issue? I can imagine e.g. the modules pluginI don’t use it— must suffer from this too. Is there some kind of “crawler” around that can be used for this purpose? Or is there a way to figure out where the content is used instead of stored in the search controller?

I hope this is making sense? If not, let me know, I’ll try to elaborate this.


#2

Can you not just filter those pages out in the search controller?

Not tested, but i imagine if you changed this line to include a filter, you can get what you want?

$results = $site->search($query, 'title|text');

I guess you would need something unique on those pages to filter on, or use something like AutoID to feed the filter a string of page ID numbers to exclude.


#3

If you filter them out, they are no included in the search, so that is not really that useful.

I think you would need a custom method and a relationship field that tells Kirby where that content appears (as Kirby would not be able to find that out by itself without parsing your templates/controllers). Then you could output that link In your search results instead of the page URL.

I am not sure how third party extensions like the Algolia plugin work, maybe that could be an alternative?


#4

But that is what @bvdputte wants. These are not whole pages, but utility pages used as part of other pages, and not meant to be viewed directly, and should not appear in the search.


#5

I answer with this quote:

So for me that means, @bvdputte does want the content to appear in the search results, but not with their own (inaccessible) URL, but with the URL of the page where the content appears. Correct me if I’m wrong, @bvdputte.


#6

You are 100% right. This is what I would like to achieve.


#7

Ok so, help me understand how the search works then…

Consider that Page D is constructed from data contained in fields on pages A, B and C. If you filter out pages A, B and C, does that mean that Kirby will not see this data as belonging to Page D, and it will not be searchable? Page D will essentially be empty as far as the search is concerned?


#8

Exactly. As you’ld see in the page-d.en.md file in the /content folder :wink:
And this is where the Kirby search function looks into for matches.

The logic that aggregates those 3 files into 1 lives in the template, and Kirby does not know this of relationship (as @texnixe beautifully said) outside that template.


#9

I see. So maybe this plugin will help then? You essentially want to redirect direct hits to pages A, B and C to page D. Would that work for you?

You do have another problem - getting Google and other search engines to see the “real” page in search results. You will probably have to get clever with the site map.


#10

Why? Google crawls URLs and their content, I don’t see how that would affect search results.


#11

Because they will index pages A, B and C, when all is is needed is Page D.


#12

Exactly. Google indexes URL’s. And this is what makes it ambiguous for users of the website who use the “on-site search” -> this yields different results then google. Which is (technically) obvious for above mentioned reasons, but not wanted/expected behaviour.


#13

No, if there is no link to those A, B, C pages, because they are not accessible, no problem.


#14

That’s true. The built-in search method is just useful in very simple cases, where basically a folder is a page that is accessible via its URL. It is a page/site method after all, and pages are tied to folders.

That’s why you have to make sure that those folders do not appear in URLs (using routes, for example) and that they don’t appear in the search results as pages, but at the same time make them appear in search results for those aggregated pages (using relationships or some other search method).


#15

I have had pages like this show up in search results accidentally because they have found their way somehow into the sitemap.xml file. It took some time to flush them out of Googles index.


#16

So, now we have identified the symptoms, are there people with a solution/workaround?
Tbh I’m a bit surprised no one has ever experienced this? :eyes:


#17

Just some ideas:

<?php

  function getData($url) {
  	$ch = curl_init();
  	$timeout = 5;
  	curl_setopt($ch, CURLOPT_URL, $url);
  	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  	curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
  	$data = curl_exec($ch);
  	curl_close($ch);
  	return $data;
  }


  $q = 'Kirby';
  $matches = [];
  foreach($site->index() as $p) {
    $returned_content = getData($p->url());
    if (strpos($returned_content, $q) !== false) {
        $matches[] = $p->url();
    }
  }
  dump($matches);

?>
  • maybe even read the cache files, provided the cache is rebuilt after every flush.

  • or, as I suggested above, custom search with relationships


#18

Relationships is quite another beast to handle imho. I know e.g. ola has developed a relationship plugin, but as I expressed my concern, it doesn’t handle “keep established relationships working” well (e.g. in case of url renames or page deletions)? You’re on your own for accounting this, via hooks and other plugins :slight_smile:. This makes the custom development dependency graph quite huge for a little use-case; imho it feels like building a house of cards.

Thanks for your insights in this, @texnixe.

When I find some time, I might have a go at your other proposal about building a “crawler” ourselves and searching its built cache then. Performance will be a pickle I’m afraid. Maybe cron can help with that? Does anyone know of good, simple, open source PHP crawler that we could plug in for this? I would like to avoid re-inventing the wheel.


#19

Well, maybe someone comes up with some better ideas.

I see the problem with relationships, but that is something you will have to take care of no matter which of the methods you use. With a solution like Algolia you have to keep your index intact, with a database index as well, although maybe easier to achieve.

I don’t know your exact use case, but I would think that if you tell a non-accessible module page that it is part of container X (rather than telling X what modules belong to it) and if X is less likely to change its URL or be deleted, then this relationship thingy could work.

But maybe the crawling solutions are more reliable after all. Anyway, I was just trying to come up with some ideas, seeing that up to now nobody has come around with a ready-to-use solution. I have never tested any of these approaches myself.

Anyway, interesting topic. Let me finish this with this list of crawlers: https://github.com/BruceDone/awesome-crawler


#20

So for me that means, @bvdputte does want the content to appear in the search results, but not with their own (inaccessible) URL, but with the URL of the page where the content appears.

You are 100% right. This is what I would like to achieve.

I’m not sure if this solution would work on your search context, but in a different situation I have used Kirby models to change the URL of related pages that shouldn’t exist on their own.

<?php // site/models/{related-page-template-name}.php

class Relatedpage extends Page
{
      public function url()
      {
          return $this->parent()->url();
      }
}