Search: crawl instead of search markdown content files

bvdputte · October 11, 2017, 6:51am

In some websites I create “hidden sections” or “hidden subpages” to manage little pieces of structured content that are not directly accessible, but used in another page. They are usually not directly accessible because they are “too small” to create a dedicated page out of them.

How can I work around the built-in search, which returns the page where the content is “stored” (which is usually hidden/not directly accessible) instead of “used”.

How do you work with this issue? I can imagine e.g. the modules plugin — I don’t use it— must suffer from this too. Is there some kind of “crawler” around that can be used for this purpose? Or is there a way to figure out where the content is used instead of stored in the search controller?

I hope this is making sense? If not, let me know, I’ll try to elaborate this.

jimbobrjames · October 11, 2017, 8:18am

Can you not just filter those pages out in the search controller?

Not tested, but i imagine if you changed this line to include a filter, you can get what you want?

$results = $site->search($query, 'title|text');

I guess you would need something unique on those pages to filter on, or use something like AutoID to feed the filter a string of page ID numbers to exclude.

texnixe · October 11, 2017, 9:06am

If you filter them out, they are no included in the search, so that is not really that useful.

I think you would need a custom method and a relationship field that tells Kirby where that content appears (as Kirby would not be able to find that out by itself without parsing your templates/controllers). Then you could output that link In your search results instead of the page URL.

I am not sure how third party extensions like the Algolia plugin work, maybe that could be an alternative?

jimbobrjames · October 11, 2017, 9:39am

But that is what @bvdputte wants. These are not whole pages, but utility pages used as part of other pages, and not meant to be viewed directly, and should not appear in the search.

texnixe · October 11, 2017, 10:01am

I answer with this quote:

So for me that means, @bvdputte does want the content to appear in the search results, but not with their own (inaccessible) URL, but with the URL of the page where the content appears. Correct me if I’m wrong, @bvdputte.

bvdputte · October 11, 2017, 10:03am

You are 100% right. This is what I would like to achieve.

jimbobrjames · October 11, 2017, 10:07am

Ok so, help me understand how the search works then…

Consider that Page D is constructed from data contained in fields on pages A, B and C. If you filter out pages A, B and C, does that mean that Kirby will not see this data as belonging to Page D, and it will not be searchable? Page D will essentially be empty as far as the search is concerned?

bvdputte · October 11, 2017, 10:11am

Exactly. As you’ld see in the page-d.en.md file in the /content folder
And this is where the Kirby search function looks into for matches.

The logic that aggregates those 3 files into 1 lives in the template, and Kirby does not know this of relationship (as @texnixe beautifully said) outside that template.

jimbobrjames · October 11, 2017, 10:14am

I see. So maybe this plugin will help then? You essentially want to redirect direct hits to pages A, B and C to page D. Would that work for you?

You do have another problem - getting Google and other search engines to see the “real” page in search results. You will probably have to get clever with the site map.

texnixe · October 11, 2017, 10:21am

Why? Google crawls URLs and their content, I don’t see how that would affect search results.

jimbobrjames · October 11, 2017, 10:23am

Because they will index pages A, B and C, when all is is needed is Page D.

bvdputte · October 11, 2017, 10:23am

Exactly. Google indexes URL’s. And this is what makes it ambiguous for users of the website who use the “on-site search” → this yields different results then google. Which is (technically) obvious for above mentioned reasons, but not wanted/expected behaviour.

texnixe · October 11, 2017, 10:23am

No, if there is no link to those A, B, C pages, because they are not accessible, no problem.

texnixe · October 11, 2017, 10:27am

That’s true. The built-in search method is just useful in very simple cases, where basically a folder is a page that is accessible via its URL. It is a page/site method after all, and pages are tied to folders.

That’s why you have to make sure that those folders do not appear in URLs (using routes, for example) and that they don’t appear in the search results as pages, but at the same time make them appear in search results for those aggregated pages (using relationships or some other search method).

jimbobrjames · October 11, 2017, 10:27am

I have had pages like this show up in search results accidentally because they have found their way somehow into the sitemap.xml file. It took some time to flush them out of Googles index.

bvdputte · October 12, 2017, 6:19pm

So, now we have identified the symptoms, are there people with a solution/workaround?
Tbh I’m a bit surprised no one has ever experienced this?

texnixe · October 12, 2017, 10:33pm

Just some ideas:

https://codecanyon.net/item/easy-web-search-php-search-engine-with-image-search-and-crawling-system/17574164
use curl and feed it your (cleaned up) sitemap (probably not very performant unless cached or indexed in a database), really basic example, just as a starting point:

<?php

  function getData($url) {
  	$ch = curl_init();
  	$timeout = 5;
  	curl_setopt($ch, CURLOPT_URL, $url);
  	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  	curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
  	$data = curl_exec($ch);
  	curl_close($ch);
  	return $data;
  }


  $q = 'Kirby';
  $matches = [];
  foreach($site->index() as $p) {
    $returned_content = getData($p->url());
    if (strpos($returned_content, $q) !== false) {
        $matches[] = $p->url();
    }
  }
  dump($matches);

?>

maybe even read the cache files, provided the cache is rebuilt after every flush.
or, as I suggested above, custom search with relationships

bvdputte · October 13, 2017, 6:04am

Relationships is quite another beast to handle imho. I know e.g. ola has developed a relationship plugin, but as I expressed my concern, it doesn’t handle “keep established relationships working” well (e.g. in case of url renames or page deletions)? You’re on your own for accounting this, via hooks and other plugins . This makes the custom development dependency graph quite huge for a little use-case; imho it feels like building a house of cards.

Thanks for your insights in this, @texnixe.

When I find some time, I might have a go at your other proposal about building a “crawler” ourselves and searching its built cache then. Performance will be a pickle I’m afraid. Maybe cron can help with that? Does anyone know of good, simple, open source PHP crawler that we could plug in for this? I would like to avoid re-inventing the wheel.

texnixe · October 13, 2017, 7:57am

Well, maybe someone comes up with some better ideas.

I see the problem with relationships, but that is something you will have to take care of no matter which of the methods you use. With a solution like Algolia you have to keep your index intact, with a database index as well, although maybe easier to achieve.

I don’t know your exact use case, but I would think that if you tell a non-accessible module page that it is part of container X (rather than telling X what modules belong to it) and if X is less likely to change its URL or be deleted, then this relationship thingy could work.

But maybe the crawling solutions are more reliable after all. Anyway, I was just trying to come up with some ideas, seeing that up to now nobody has come around with a ready-to-use solution. I have never tested any of these approaches myself.

Anyway, interesting topic. Let me finish this with this list of crawlers: https://github.com/BruceDone/awesome-crawler

pedroborges · October 13, 2017, 9:12am

So for me that means, @bvdputte does want the content to appear in the search results, but not with their own (inaccessible) URL, but with the URL of the page where the content appears.

You are 100% right. This is what I would like to achieve.

I’m not sure if this solution would work on your search context, but in a different situation I have used Kirby models to change the URL of related pages that shouldn’t exist on their own.

<?php // site/models/{related-page-template-name}.php

class Relatedpage extends Page
{
      public function url()
      {
          return $this->parent()->url();
      }
}

Topic		Replies	Views
Ignore content in hidden "blocks" in frontend search results Questions v3	4	275	September 21, 2023
Exclude/ filter helpsection from search Questions v3	20	587	October 20, 2020
Search stuff in structure fields? Questions v3	11	900	October 7, 2019
Kirby search gives incomplete search result Questions v2	11	1218	February 6, 2025
Excluding one pager sections from search results Questions v3	5	509	December 20, 2021

Search: crawl instead of search markdown content files

Related topics