A short entry describing a tested function for linkifying twits. If you’re following non-us twetters you’ll sometimes find that they write hashtags “properly”, using diacritical characters. Twitter handle those hashtags properly, so #telefónica will be understood as #telefonica.
But if you’re writing your own application you’ll need to handle those characters by yourself. Twitter send them encoded as Unicode, so it’s easy to process them using PCRE… if your installation of PCRE is compiled with full Unicode support.
So, the first step is to check your configuration. This snippet will tell you if PCRE supports Unicode.
if (@preg_match('/\pL/u', 'a') == 1) { echo "PCRE unicode support is turned on."; } else { echo "PCRE unicode support is turned off."; }
The solution varies from distro to distro. Here you’ll find the way to get Unicode support on Centos.
And now for a simple function to linkify the tweets.
/** * Linkify function with support for utf-8 characters. * * @param $text is the raw text of the tweet * @return the same tweet with links, users and hashtags linkified * */ function procesaTwit($text) { return preg_replace(array( '@(https?://([-\w\.]+)+(/([\w/_\.\-]*(\?\S+)?(#\S+)?)?)?)@', '/@(\w+)/', '/#([\d\w\pL]+)/', ), array( '<a href="$1" target="_blank">$1</a>', '<a href="http://twitter.com/$1">@$1</a>' '<a href="http://twitter.com/#!/search?q=%23$1">#$1</a>' ), $text); }
That’s all you need to properly linkify tweets with Unicode characters.