Indy, TIdURI.PathEncode, URLEncode and ParamsEncode and more

Frequently in Delphi we come across the need to encode a string to stuff into a URL query string parameter (as per web forms).  One would expect that Indy contains well-tested functions to handle this.  Well, Indy contains some functions to help with this, but they may not work quite as you expect.  In fact, they may not be much use at all.

Indy contains a component called TIdURI.  It contains, among other things, the member functions URLEncode, PathEncode, and ParamsEncode. At first glance, these seem to do what you would need.  But in fact, they don’t.

URLEncode will take a full URL, split it into path, document and query components, encode each of those, and return the full string.  PathEncode is intended to handle the nuances of the path and document components of the URL, and ParamsEncode handles query strings.

Sounds great, right?  Well, it works until you have a query parameter that has an ampersand (&) in it.  Say my beloved end user want to search for big&little.  It seems that you could pass the following in:

s := TIdURI.URLEncode('http://www.google.com/search?q='+SearchText);

But then we get no change in our result:

s = 'http://www.google.com/search?q=big&little';

And you can already see the problem: little is now a separate parameter in the query string.  How can we work around this?  Can we pre-encode ampersand to %26 before you pass in the parameters?

s := TIdURI.URLEncode('http://www.google.com/search?q='+ReplaceStr(SearchText, '&', '%26'));

No:

s = 'http://www.google.com/search?q=big%25%26little';

And obviously we can’t do it ourselves afterwards, because we too won’t know which ampersands are which.  You could do correction of ampersand by encoding each parameter component separately and then post-processing the component for ampersand and other characters before final assembly using ParamsEncode. But you’ll soon find that it’s not enough anyway.  =, / and ? are also not encoded, although they should be.  Finally, URLEncode does not support internationalized domain names (IDN).

Given that these functions are not a complete solution, it’s probably best to avoid them altogether.

The problem is analogous to the Javascript encodeURI vs encodeURIComponent issue.

So to write your own…  I haven’t found a good Delphi solution online (and I searched a bit), so here’s a function I’ve cobbled together (use at your own risk!) to encode parameter names and values. You do need to encode each component of the parameter string separately, of course.

function EncodeURIComponent(const ASrc: string): UTF8String;
const
  HexMap: UTF8String = '0123456789ABCDEF';

  function IsSafeChar(ch: Integer): Boolean;
  begin
    if (ch >= 48) and (ch <= 57) then Result := True    // 0-9
    else if (ch >= 65) and (ch <= 90) then Result := True  // A-Z
    else if (ch >= 97) and (ch <= 122) then Result := True  // a-z
    else if (ch = 33) then Result := True // !
    else if (ch >= 39) and (ch <= 42) then Result := True // '()*
    else if (ch >= 45) and (ch <= 46) then Result := True // -.
    else if (ch = 95) then Result := True // _
    else if (ch = 126) then Result := True // ~
    else Result := False;
  end;
var
  I, J: Integer;
  ASrcUTF8: UTF8String;
begin
  Result := '';    {Do not Localize}

  ASrcUTF8 := UTF8Encode(ASrc);
  // UTF8Encode call not strictly necessary but
  // prevents implicit conversion warning

  I := 1; J := 1;
  SetLength(Result, Length(ASrcUTF8) * 3); // space to %xx encode every byte
  while I <= Length(ASrcUTF8) do
  begin
    if IsSafeChar(Ord(ASrcUTF8[I])) then
    begin
      Result[J] := ASrcUTF8[I];
      Inc(J);
    end
    else if ASrcUTF8[I] = ' ' then
    begin
      Result[J] := '+';
      Inc(J);
    end
    else
    begin
      Result[J] := '%';
      Result[J+1] := HexMap[(Ord(ASrcUTF8[I]) shr 4) + 1];
      Result[J+2] := HexMap[(Ord(ASrcUTF8[I]) and 15) + 1];
      Inc(J,3);
    end;
    Inc(I);
  end;

  SetLength(Result, J-1);
end;

To use this, do something like the following:

function GetAURL(const param, value: string): UTF8String;
begin
  Result := 'http://www.example.com/search?'+
    EncodeURIComponent(param)+
    '='+
    EncodeURIComponent(value);
end;

Hope this helps. Sorry, I haven't got an IDN solution in this post!

6 thoughts on “Indy, TIdURI.PathEncode, URLEncode and ParamsEncode and more

  1. Yeah, return value should probably be RawByteString or AnsiString(1252) for clarity, not UTF8String. Won’t change anything in this case as guaranteed to return only chars between $33 and $7F!

  2. In Delphi XE5’s (don’t know how far back this goes) REST.Utils unit:

    function URIEncode(const S: string): string;

    Looks like it does everything properly. It’s used by the REST.Client unit’s TCustomRESTRequest.

    1. Thanks James – it’s certainly not in XE2 but will investigate when I get a chance (have XE5, just have not yet had the time to move to it).

  3. Yes, TIdURI has quite a few known limitations.

    Note that TIdHTTP methods expect a fully encoded URL as input, so the ReplaceStr() approach would “work” if you skip TIdURI altogether:

    TIdHTTP.Get(‘http://www.google.com/search?q=’+ReplaceStr(SearchText, ‘&’, ‘%26’));

    Obviously you will have issues if SearchText contains other characters that also need to be percent-encoded.

    The way TIdURI is *intended* to be used in this example is more like this:

    TIdURI.URLEncode(‘http://www.google.com/search?q=’+SearchText)

    Or at least this:

    ‘http://www.google.com/search?’+TIdURI.ParamsEncode(‘q=’+SearchText)

    However, as you noted, TIdURI.ParamsEncode() (and TIdURI.PathEncode()) does not percent-encode ‘&’ characters, but it does percent-encode ‘%’ characters.

    ‘%’ by itself is a reserved character that must be percent-encoded, however RFC 3986 (and 3987) allows ‘%’ to be unencoded when it is used in a percent-encoded octet sequence. TIdURI.ParamsEncode() (and TIdURI.PathEncode()) does not currently account for that rule. That is a bug that should be fixed (https://github.com/IndySockets/Indy/issues/176).

    ‘&’, on the other hand, is not being percent-encoded because RFC 3986 (and 3987) specifically allows unencoded ‘&’ in the path and query components. URIs/IRIs in general don’t know anything about “name=value” pairs in the query component, let alone that they are separated by unencoded ‘&’. That convention is from the HTML standard in the “application/x-www-form-urlencoded” media type, which separates “name=value” pairs by an unencoded ‘&’ and then percent-encodes all non-alphanumeric characters including ‘&’ in the “name” and “value” subcomponents. Encoded webform data is compatible as-is with a URL query string when submitting a webform using an HTTP GET instead of a POST. Neither the URI/IRI standards, or the HTTP protocol definition of HTTP URIs, require percent-encoded ‘&’ in the query component.

    When TIdURI.ParamsEncode() is fixed to recognize pre-existing ‘%HH’ octets, the following will then work correctly as expected:

    ‘http://www.google.com/search?’+TIdURI.ParamsEncode(‘q=big%26little’)

    1. Thanks Remy for a great explanation and clarification of how URL encoding works. Given that the application/x-www-form-urlencoded media type is so prevalent, perhaps more work could be done to support that without needing to pre-process input as per your final example? Certainly, when interfacing with other websites, this is by far the most common use case I have encountered.

      'http://www.google.com/search?'+TIdURI.ParamsEncodeComponent('q=big&little')

      I do feel like the recognition of existing %HH octets in TIdURI.ParamsEncode() actually leads to extra complications: how would you call TIdURI.ParamsEncode with a literal '%26' parameter? Would you need to pre-process that and call TIdURI.ParamsEncode('%2626') for it to work in that context?

Leave a Reply

Your email address will not be published. Required fields are marked *