dataflake.org

Home Documentation Software Old Stuff

improving non-ascii DNs support (Resolved)

Request LDAP User Folder -- bug report -- by Wichert Akkerman
Posted on Jan 2, 2006 8:17 am
Subscribe

Enter your email address to receive mail on every change to this issue.

Entries (Latest first)


  Resolve by Jens Vagelpohl on Aug 22, 2006 10:39 am
  I'm going to close this issue since it's not receiving any further input. Unless the encoding methodology is changed as a whole there should be separate issues per method/case where the current way of encoding/decoding is problematic.
 

  Comment by Jens Vagelpohl on Mar 3, 2006 12:05 pm
  > Another alternative would be for our delegate to encode all unicode
> into latin1 (the default in utils.py, if I am not mistaken), but a
> concern I have with that is that we will still fail should the
> directory contain characters not in that character set (and in a
> truly international directory, there may be *no* character set that
> covers all options).

That setting is in utils.py so you can *change* it ;) IMHO UTF-8 is
always a safe option because "normal" directory servers use that
themselves to encode return values for queries.


> Do you think that patch looks reasonable? I believe similar issues
> will still remain though, and the "correct" answer is probably to
> pass unicode objects around, thereby making it the delegate's
> responsibility to worry about encodings. Would you be open to such
> a patch?

What I really want to get to is a "clean" way of handling strings.
Internally and between the user folder and the delegate all unicode,
between the delegate and the LDAP server a configurable encoding
(defaulting to UTF-8), and out towards the browser it should use
whatever is set in zope.conf itself. This is a lot of work, though.
What's there right now "grew" more than it was planned, it is the
result of trial and error testing. The problem is that some methods
get called from internal code as well as the browser or other places
in Zope, so you cannot clearly fence off what method should use what
encoding/unicode on a method by method basis :(

jens

 

  Comment by Mark Hammond on Mar 3, 2006 1:41 am
  Hi Jens,
The Enfold ActiveDirectory delegate returns Unicode objects, and I believe this is what Wichert was using.

At least one problem caused by this goes away with the following patch:

--- utils.py (revision 1305)
+++ utils.py (working copy)
@@ -98,7 +98,9 @@
return encodeLocal(decodeUTF8(s)[0])[0]

def to_utf8(s):
- return encodeUTF8(decodeLocal(s)[0])[0]
+ if type(s)==str:
+ s = decodeLocal(s)[0]
+ return encodeUTF8(s)[0]

except LookupError:
raise LookupError, 'Unknown encoding "%s"' % encoding

Another alternative would be for our delegate to encode all unicode into latin1 (the default in utils.py, if I am not mistaken), but a concern I have with that is that we will still fail should the directory contain characters not in that character set (and in a truly international directory, there may be *no* character set that covers all options).

Do you think that patch looks reasonable? I believe similar issues will still remain though, and the "correct" answer is probably to pass unicode objects around, thereby making it the delegate's responsibility to worry about encodings. Would you be open to such a patch?

Cheers,

Mark
 

  Comment by Jens Vagelpohl on Feb 17, 2006 8:13 am
  > give it any timeline for ldapuserfolder with full functionaly of
> ascii charset (german letters). In plone when i will change the
> permissions in a folder i recieve allways following error message:

The error happens in folder_localrole_form, which is not in the
former CMFLDAP skins now included in the LDAPUserFolder product.
Please file your bug report in a more appropriate place.

 

  Comment by Reinhold Brunner on Feb 17, 2006 8:05 am
  give it any timeline for ldapuserfolder with full functionaly of ascii charset (german letters). In plone when i will change the permissions in a folder i recieve allways following error message:

Traceback (innermost last):
Module ZPublisher.Publish, line 113, in publish
Module ZPublisher.mapply, line 88, in mapply
Module ZPublisher.Publish, line 40, in call_object
Module Shared.DC.Scripts.Bindings, line 311, in __call__
Module Shared.DC.Scripts.Bindings, line 329, in _bindAndExec
Module Shared.DC.Scripts.Bindings, line 348, in _bindAndExec
Module Products.CMFCore.FSPageTemplate, line 195, in _exec
Module Products.CMFCore.FSPageTemplate, line 134, in pt_render
Module Products.PageTemplates.PageTemplate, line 104, in pt_render
- <FSPageTemplate at /realinvest/folder_localrole_form used for /realinvest/home>
Module TAL.TALInterpreter, line 202, in __call__
Module TAL.TALInterpreter, line 206, in __call__
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 711, in do_useMacro
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 426, in do_optTag_tal
Module TAL.TALInterpreter, line 411, in do_optTag
Module TAL.TALInterpreter, line 406, in no_tag
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 742, in do_defineSlot
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 426, in do_optTag_tal
Module TAL.TALInterpreter, line 411, in do_optTag
Module TAL.TALInterpreter, line 406, in no_tag
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 690, in do_defineMacro
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 734, in do_defineSlot
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 653, in do_loop_tal
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 426, in do_optTag_tal
Module TAL.TALInterpreter, line 411, in do_optTag
Module TAL.TALInterpreter, line 406, in no_tag
Module TAL.TALInterpreter, line 250, in interpret
Module TAL.TALInterpreter, line 676, in do_condition
Module Products.PageTemplates.TALES, line 221, in evaluate
Module Products.PageTemplates.ZRPythonExpr, line 47, in __call__
- __traceback_info__: entry['name']!=username
Module Python expression "entry['name']!=username", line 1, in <expression>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)


 

  Comment by Jens Vagelpohl on Jan 6, 2006 10:32 am
  The fact that findUser returns unicode, was that behavior there
before is it the result of work you did? I can't remember it ever
returning unicode.

 

  Comment by Wichert Akkerman on Jan 6, 2006 10:26 am
  I agree it is evil, but until DocumentTemplates is fixed I doubt there is a good alternative. You could add a special findUser method that returns its data in a way which DTML will handle properly but that is just as bad in my opinion.

The unicode in this case is the DN returned by findUser when called from users.dtml
 

  Comment by Jens Vagelpohl on Jan 6, 2006 10:23 am
  I hate to say it, but I absolutely abhor mankey patching like that. I
simply will not release any product under my name that monkey patches
other products or the Zope internals...

By the way, where is the unicode coming from that you're trying to
render? Is there anything in the LDAPUserFolder product that emits
unicode instead of encoded strings?


 

  Comment by Wichert Akkerman on Jan 6, 2006 10:19 am
  Here is the next essential bit of magic: DTML does not handle unicode strings correctly, which bites us when we try to display unicode DNs. This needs to be fixed in the DocumentTemplates module in Zope, but until then the monkeypatch below gets things working.


import urllib
def dtml_url_quote(v, name='(Unknown name)', md={}):
if type(v)==unicode:
v=v.encode("utf-8")
else:
v=str(v)
return urllib.quote(v)

def dtml_url_quote_plus(v, name='(Unknown name)', md={}):
if type(v)==unicode:
v=v.encode("utf-8")
else:
v=str(v)
return urllib.quote_plus(v)

from DocumentTemplate import DT_Var
DT_Var.url_quote=dtml_url_quote
DT_Var.url_quote_plus=dtml_url_quote_plus
 

  Comment by Jens Vagelpohl on Jan 4, 2006 12:16 pm
  > I have been doing some more work on this. I started off with the
> idea os having seperate functions for recoding strings coming in
> via LDAP and coming in through other means. However it appaers that
> the ldap module already returns everything in unicode, so all the
> calls dealling with ldap-destined or originated data can be removed.

Hold on, are you confusing unicode and UTF8-encoded strings here? I
have never seen python-ldap return unicode objects...


> All problems I see happening appear to be due to calling to_utf8 on
> data that already is a unicode instance. My current thinking is
> that the best approach is to convert everything to unicode that
> isn't unicode already. This will lead to too many calls of the
> unicode-conversion method but it is the only way to make sure that
> all data is in the correct format, since we do not always know
> where it is coming from.

I'm assuming here you mean real unicode objects and not UTF8-encoded
strings?


> However, as demonstrated by the objectGUID attribute, there is the
> small problem of non-string attribute values. Currently the code
> seems to be hardcoded to consider everything but objectGUID a
> string which should be unicode. For now we can go with that
> assumption, but eventually LDAP schema introspection can be a nice
> addition. Or perhaps in the meantime adding a value type to the
> LDAP schema.

IMHO in order to keep the scope sane I would only special-case
objectGUID. Everything else from LDAP is assumed to be strings - if
it is not it will be ignored. It is not in the current scope to deal
with non-string values like images or other binary data with the
exception of objectGUID.


 

  Comment by Wichert Akkerman on Jan 4, 2006 12:04 pm
  I have been doing some more work on this. I started off with the idea os having seperate functions for recoding strings coming in via LDAP and coming in through other means. However it appaers that the ldap module already returns everything in unicode, so all the calls dealling with ldap-destined or originated data can be removed.

All problems I see happening appear to be due to calling to_utf8 on data that already is a unicode instance. My current thinking is that the best approach is to convert everything to unicode that isn't unicode already. This will lead to too many calls of the unicode-conversion method but it is the only way to make sure that all data is in the correct format, since we do not always know where it is coming from.

However, as demonstrated by the objectGUID attribute, there is the small problem of non-string attribute values. Currently the code seems to be hardcoded to consider everything but objectGUID a string which should be unicode. For now we can go with that assumption, but eventually LDAP schema introspection can be a nice addition. Or perhaps in the meantime adding a value type to the LDAP schema.
 

  Comment by Jens Vagelpohl on Jan 4, 2006 11:52 am
  This is an obvious bug and I just applied your patch, thanks!

 

  Comment by Wichert Akkerman on Jan 4, 2006 11:46 am
  I am making some progress here. The tiny patch below is needed and obviously correct: the objectGUID is a binary string which can not be treated as text. There are checks for objectGUID in other places but this one was missing.

diff -wur -x .svn /home/wichert//svn/LDAPUserFolder/trunk/LDAPUser.py ./LDAPUser.py
--- /home/wichert//svn/LDAPUserFolder/trunk/LDAPUser.py 2006-01-02 13:38:54.000000000 +0100
+++ ./LDAPUser.py 2006-01-04 16:17:59.859061200 +0100
@@ -61,7 +61,7 @@
else:
prop = user_attrs.get(key, [None])[0]

- if isinstance(prop, str):
+ if isinstance(prop, str) and key!='objectGUID':^M
prop = _verifyUnicode(prop)

self._properties[key] = prop
 

  Comment by Jens Vagelpohl on Jan 4, 2006 6:43 am
  > Luckily RFC 2253 says that implementations must accept RFC 1779
> syntax, which suggests that the safe road is to never do any
> recoding of DNs we receive from the server but reuse them as-is.

That's easier said then done. I believe there are methods in the user
folder and/or the delegate (don't have the time to investigate but I
know there are) which can get DNs passed in from either the server or
the outside world... There is no clear line, unfortunately. Obviously
DNs passed in from the outside world must be processed.

 

  Comment by Wichert Akkerman on Jan 4, 2006 6:20 am
  I have been looking at DN encoding in LDAP. Section 2.4 from RFC 2253 is relevant: it says that a DN should be encoded using either #-escaping for non-string values or using utf-8 with character escaping where necessary. In both cases the result will be a valid utf-8 string.

However: it seems microsofts active directory implementation does not do this but returns DNs in ANSI 1252 (as allowed in the older RFC 1779).

Luckily RFC 2253 says that implementations must accept RFC 1779 syntax, which suggests that the safe road is to never do any recoding of DNs we receive from the server but reuse them as-is.
 

  Comment by Jens Vagelpohl on Jan 2, 2006 9:25 am
  > I'm wondering why you need to encode the dn at all in
> _lookupuserbyattr. In other places the dn is used untouched, and
> removing the utf8 conversion there seems to work fine.

That's basically been a trial-and-error process. When doing that code
I had a lot of cases where I ended up double-encoding strings. This
whole unicode business is just a complete mess.

My other problem when it comes to AD issues is that I have no way to
test anything against AD. I have no AD instance, and no windows box
at all here. As far as AD goes I have to rely on patches from people
who can devise and test bug fixes themselves. The
ActiveDirectoryPlugin code for example is from Chris McDonough who
had access to AD when he developed it. It seems that yours is one of
those specific AD cases. If you can come up with a patch yourself
I'll be happy to look at it.

 

  Comment by Wichert Akkerman on Jan 2, 2006 8:42 am
  The traceback happens because the DN is my example happens to be CP1252 encoded. The traceback seems to be related to the encoding method: the input is an invalid unicode string (it really is cp1252 but a python unicode instance is used). How the ascii codec comes into play here I don't know.

I'm wondering why you need to encode the dn at all in _lookupuserbyattr. In other places the dn is used untouched, and removing the utf8 conversion there seems to work fine.
 

  Accept by Jens Vagelpohl on Jan 2, 2006 8:28 am
  Hm... thinking about that default encoding, wouldn't it make sense to use the one in zope.conf instead of defining it for the user folder? What do you think? I'm leery of adding more knobs to the already overcomplicated configuration screens...

That UTF-8 conversion, the DN in the traceback looks like ASCII to me. Or is it just a generic example to show that setting the encoding to UTF-8 is broken? Let me know what exactly is being proven with the traceback.
 

  Initial Request by Wichert Akkerman on Jan 2, 2006 8:17 am
  Active Directory seems very fond of creating non-ascii DNs, resulting in tracebacks such as the one below.

While looking at this I noticed two things that can be improved:

- the character set is set in utils.py, but you can not (easily) change it except by modifying LDAPUserFolder. It would be nice if you could set the encoding per LDAPUserFolder as a property

- the utf8-to-utf8 case uses str() which will not work since str turns a unicode string into ascii; it should probably use a lambda identity function (lambda x: x)

site.acl_users.adsi.acl_users.getUserById("test@ads.intranet.amaze.nl")
Traceback (most recent call last):
File "<input>", line 1, in ?
File "C:\Program Files\Enfold Server\Products\LDAPUserFolder\LDAPUserFolder.py", line 735, in getUserById
user = self.getUserByAttr(self._uid_attr, id, cache=1)
File "C:\Program Files\Enfold Server\Products\LDAPUserFolder\LDAPUserFolder.py", line 657, in getUserByAttr
user_roles, user_dn, user_attrs, ldap_groups = self._lookupuserbyattr(
File "C:\Program Files\Enfold Server\Products\LDAPUserFolder\LDAPUserFolder.py", line 250, in _lookupuserbyattr
utf8_dn = to_utf8(dn)
File "C:\PROGRA~1\ENFOLD~3\Client_5/../Products\LDAPUserFolder\utils.py", line 101, in to_utf8
return encodeUTF8(decodeLocal(s)[0])[0]
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 5: