URL validator without "https://"

Hi all, just wondering if it’s possible to validate a URL but to not have to include the https:// part? I’m currently using the URL Validator but this doesn’t appear to be possible.

Thanks!

A web address without the scheme component (e.g. https) is not a valid “URL” in the strict technical sense, hence the validator returns false. The easiest way is to prepend the missing scheme as you call the validator:

V::url('https://' . $string)

Or, if you want to get true regardless of whether $string contains the https:// part or not, you can add the scheme conditionally when it’s not part of the string:

V::url((empty(parse_url($string)['scheme']) ? 'https://' : '') . $string)

Thanks @sebastiangreger, I implemented your suggestion and it seemed to work well but I ended up not using validation on the field as I found the URL validation not completely reliable.

Indeed, finding a regular expression that reliably detects any variation a valid URL may take is something like the holy grail of string parsing, with endless sources debating over the best approach :roll_eyes: (e.g. the Kirby source code refers to this impressive collection of various solutions; Kirby uses a parser very close to that by diegoperini which meets all but one test case in that table).

Just out of curiosity: are you able to recall what format of URL in particular slipped through the validator in your tests?

Interesting, thanks for the info @sebastiangreger.

I can’t remember for certain but I believe I found that it was validating anything with a full stop between it. So it could just be something like “sdfsdf.dfgfg”. I guess if it needs to accept any domain with or without “www” then only one full stop is correct but I thought it would compare against valid domains.

Yes, the V::url validator only checks that a string conforms with the syntax rules for URLs, not that the URL actually exists or resolves to something meaningful. If you need to validate that a string resolves to an actual website, there are ways – though none of them is 100% perfect:

Validate that a DNS entry exists for the given domain

$url = (empty(parse_url($string)['scheme']) ? 'https://' : '') . $string;
if (V::url($url)) {
    $domain = parse_url($url)['host'];
    if (filter_var(gethostbyname($domain), FILTER_VALIDATE_IP)) {
      // domain resolves to a DNS entry
    } else {
      // no IP address can be found for this domain
    }
}

This validates that the provided domain exists, but not that the full URL really leads to someplace meaningful. A DNS lookup may take several seconds, and you would get false negatives for domains that are registered but not assigned to a server at that moment. I’d consider this the most performant, considerate and safe option.

Fetch the headers returned by the URL

$url = (empty(parse_url($string)['scheme']) ? 'https://' : '') . $string;
if (V::url($url)) {
  $headers = get_headers($url);
  // $headers[0] contains the HTTP response code
}

This actually makes a request to the given URL and then you know what HTTP response code it would return. While already the slower option, it has limitations: most importantly, servers may reject such “bot” requests while the URL would actually resolve in a browser. I’d only do this if it’s absolutely crucial to validate that a full URL resolves; you’ll get less false negatives with the first approach …also be careful to not inadvertently provide your users with a DDoS bot :wink:

Thanks so much @sebastiangreger. I very much appreciate the advice.

I assumed a collection of possible domains (.com, .info etc) could be searched to see if the domain entered was at least possible. Making sure it actually exists maybe wouldn’t be necessary if it requires a DNS lookup. As there are only a limited amount of domains available perhaps this wouldn’t be a bad approach? In my opinion, if the validation checked the syntax of the url was correct and the part after the last full stop was a possible domain that would be the most ideal solution.

That would still require you either check against some API that returns all the currently over 1500 TLDs (as of April 2021) or you would have to create an array of all possible TLDs and use that as a reference. That would have to be kept up to date, though.

You can find a list here: https://data.iana.org/TLD/tlds-alpha-by-domain.txt

Thanks @texnixe. Yeah, I think an array of possible TLDs would be good. It’s not like it has to be 100% accurate. That’s just my opinion though, @sebastiangreger what do you think?

I’m not so sure whether a validator that’s “not 100% accurate” is really that useful, quite frankly. The only gain you’d have (indeed at the expense of having to keep that TLD list up-to-date) would be that now “sdfsdf.dfgfg” won’t be true any longer, but “sdfsdf.name” still would.

The correct validation strategy really depends on why you need to validate, but if it’s just about reducing the probability of a provided URL to be fake, the effort might not be worth it. I haven’t tried the DNS validation (re: how performant it is), but to me it seems that would be a better compromise solution than checking the TLD, if you don’t want to go down the full-blown get_headers path to actually check what’s behind a URL.

That said, it would of course be rather easy to import and keep a local copy of that IANA list as an array and extend above sample code to run a match against end(explode(".", parse_url($url, PHP_URL_HOST)))

… and probably also test punycode and utf8 versions of the domain, since both should be valid but the list above only contains the IDN variant.

1 Like

100%!