Hiding evil code in invisible unicode

Sat Apr 19 12:38:00 UTC 2025

> On 04/19/2025 1:14 AM PDT Hal Murray via devel <devel at ntpsec.org> wrote:
>  
> We allow/require UTF-8 rather than simple ASCII.  I know we need that to 
> get the character for micro, as in microseconds.  Do we need it for 
> anything else?

We should be able to get away with closer to ASCII, if we encode
micro and such as (unicode) escape sequences or points, such as
"\ub5" or "\xb5"; we might want unicode for contributer names later.

> I saw a note recently about AI being susceptable to hiding evil code in invisible unicode.
> 
> New Vulnerability in GitHub Copilot and Cursor: How Hackers Can Weaponize 
> Code Agents
>   https://www.pillar.security/blog/new-vulnerability-in-github-copilot-and-
> cursor-how-hackers-can-weaponize-code-agents
> 
> -----
> 
> Is there a package we should be using that checks code for invisible unicode?

I feel compelled to mention (dang NIH*) filescan[1] which is
something I wrote for gspsd to detect higher codepoints, tabs, and
trailing whitespace.

I have nto looked at that blog post yet, but a more focussed tool
written by someone else would generally be more appropriate.

* Not Invented Here

[1] https://gitlab.com/gpsd/gpsd/-/blob/master/devtools/filescan