« BigTemplateWindow | Main | What is a Cool IRI? »

Cool IRI for a permalink

World Wide Web has been mainly constructed with US-ASCII, even for URL. But in now and future, WWW will shift from US-ASCII only world to Internatinalized world with the help of Unicode, UTF-8, IRI, and so on.

If you are European, or Asian, or others, you would write an entry by your own language, therefore the title of it also will be with yours. However, the default MT removes or deforms non US-ASCII characters in title to make an entry basename for permalink URL(It is completely right in URI side). As a result, they couldn't enjoy full features of MT that can create Cool URI. Only English users could do.

This plugin makes your permalinks as Cool IRI although your preferred language doesn't use US-ASCII. If you are unfamiliar with the notion of IRI and Cool URI, please read What is a Cool IRI? (Of course you don't have to fully understand underlying mechanism in order to use this plugin. However the notion of Cool IRI is very easy and natural to understand. If you have no time to read, just see some pictures in it).


* Perhaps I guess you may feel uneasiness for adapting IRI to your current blog because you've been accustomed to US-ASCII only addresses(URI). However I'm sure the Internationalized address will be more and more general and supported by most of browsers and web servers.
So I recommend you to try to test Cool IRI by creating a new MT instance under mt2/ folder with Berkley DB. It enables your current blogs/DB to be uneffected by this plugin.

Requirement

  • Movable Type ver. 3.2
  • Perl 5.8 (I'm not sure exactly what version of Perl started to support Unicode)
  • Perl's Encode module
  • PHP's mbstring extension, if you use Dynamic Publishing.
  • A right to customize .htaccess in user directories, if you use Apache Web Server.

Installation

Download this plugin and untar it under (mt home) folder like $ tar xvfz coolIRI-3.2.04.tar.gz. Then the files will be located as below:

  • (mt home)/dot_htaccess
  • (mt home)/url_convert.cgi
  • (mt home)/plugins/BigPAPI.pl
  • (mt home)/plugins/alogblog/coolIRI.pl
  • (mt home)/php/plugins/init.alogblog-coolIRI.php
  • (mt home)/default_templates/dynamic_site_bootstrapper.tmpl

Archive Mapping for Cool IRI



  • yyyy, mm, dd, entry_basename, index.html
    You can use default MT tags, format specifiers(%y, %m, %d, %b, %i) and "yyyy/mm/entry_basename/index.html", and so on.
  • category/sub_category
    Original MT tag for this mapping is <MTSubCategoryPath>(= %c). So new tag with Cool IRI is named as <MTSubCategoryCoolPath>. Format specifier like %c for this new tag wasn't possible by using only plugin API. So more simple and memorizable tag <MTCatSubcat> is also provided.

    For example,
    if you want "Category Archive" to be "category/sub_category/index.html", then you can enter '<MTCatSubcat>/%i' in "Custum..." field.
    If you want "Individual Entry Archive" to be "category/sub-category/entry_basename/index.html", then you can use '<MTCatSubca separator="-">/%b/%i'.

  • primary_category
    <MTArchiveCoolCategory> (= <MTPriCat>)

    For example,
    if you want "Individual Entry Archive" to be "primary-category/entry_basename/index.html", then you can use '<MTPriCat dirify="-">/%b/%i'.


You may be confused that MTCatSubcat uses separator attribute while MTPriCat does dirify attribute when it is required to have "-(minus or dash sign)" instead of space characters.(Default is "_(underline sign)". I followed MT's internal behaviors. I guess it is due to some PHP routines.

Adding IRI feature to Web Server

If we set archive mapping and rebuild archives, we can fully use Cool IRI.
However lastly we should consider the cases the encoding of incoming address is not UTF-8 and we don't have web server implementing IRI feature. Therefore, I provides a Perl CGI and PHP modifications for each static and dynamic publishing.

Static Publishing
  1. Move two files(dot_htaccess and url_convert.cgi) to the blog site root. For example, $ mv dot_htaccess /home/users/joon/www/blog/
  2. Rename dot_htaccess to .htaccess, and give the execute-permission to url_convert.cgi. For example, $ mv dot_htaccess .htaccess; chmod +x url_convert.cgi
  3. Open .htaccess file then one line is shown like "ErrorDocument 404 /blog/url_convert.cgi". The path, /blog/url_convert.cgi, is the relative path from web root to url_convert.cgi. Modify it to your environment. Of course, if you have already your .htaccess, then you only have to add this line to it.
  4. Open url_convert.cgi file then modify 3 variables(some comments are there) to your environment
  5. Lastly, in url_convert.cgi, you can see the line "use Encode::Guess qw(utf8 euc-kr euc-jp shift_jis iso-8859-1);" The last part of this line is some encoding you want to guess. If in your language utf8 and iso-8859-1 encodings are generally used, erase the other encodings. My IRI language is Korean, so in my case, the last part will be like "qw(utf8 euc-kr)". If Japanese, it'll be qw(utf8 euc-jp shift_jis) will be. If Chinese, qw(utf8 gb2312). I don't know the frequently used encodings to each language. I think you will konw it better.
Dynamic Publising
  1. Go to MT's Index Templates menu page.
  2. Select "Dynamic Site Bootstrapper" templates by checking and run "Refresh Template" action.
  3. Open that refreshed template, then you can see the line like "mb_detect_encoding($path,'UTF-8, EUC-KR, EUC-JP, SHIFT_JIS, GB2312, ISO-8859-1');"
  4. Refer to 5th explanation of above static publishing and modify it to your language's preferred encodings.

Notes

  • IMPORTANT!!! - In MT's "Setting" -> "New Entry Defaults" menu page, you can set "Basename Length". Default value is 30 characters. MT internally assigns 250 bytes to basename. By the way, UTF-8 uses averagely 2 or 3 bytes per a non US-ASCII character. If you set it to 250, the required bytes might be 250x(2~3) bytes, which will be over the limit. For safety, I recommand you to set it 80 (~= 250/3). Maximum 80 characters for a basename will be much enough.
  • MT's default dirify function removes all punctuations and converts spaces to "_(underline)" by default. What is a role of "_"? It'll works for a readability, although it also does for keywords separation.
    By the way, in some languages, the readability can be done with other characters that is not space. For example in Japanese, let's take a "・"(just call it middle-point).

    This middle-point is generally(I'm not sure) used like "イーク・アセス". In this case, if we follow MT's punctuation policy, this middle-point will be removed and result words will be like "イークアセス". I think this is bad for a readability and keywords for SE. I think "イーク_アセス" will be better in meaning and readabilty.

    So I considered 4 characters(1、2 3,4・) as a space.
    For example, the phrase "表、明 自,スポーツ,本ク・ア" will be IRI-dirifyed as "表_明_自_スポーツ本ク_ア".

    If you want other characters in your language to work as space, let me know it.


  • This plugins makes a permalink to be Cool IRI, so unCool factor like entry ID will never be appeared in permalinks. For example, a default permalink of category/date-based archive has a form like "http:/ /www.example.com/ blog/ movie/ thriller/#000234". If you use this plugin, it(<MTEntryPermalink archive_type="Category">) will be like "http://www.example.com/ blog/ movie/ thriller/#my_best_thriller_in_2005". So in archive templates except individual entry archive, you have to replace '<a id="a<$MTEntryID pad="1"$>"></a>' with '<a id="<$MTEntryBasename$>"></a>'.

Credits

Thanks, Kevin Shay. Basename field in Edit Entry page could be possible in non US-ASCIIs easily with his BigPAPI.

License

Relased under the Creative Commons License.

Version History

  • 3.2.04: redirect subroutin: use HTML meta only when URL is of individual entry.
  • 3.2.03: IRI for every possible language.
  • 1.0 : Korean language specific plugin.

TrackBack

TrackBack URL: http://alogblog.com/movabletype/plugins/TCode.name/44.

※ If you send a trackback by using an automatic blogging tool like QuickPost or so, then your ping will not be shown until the site owner approve it. This is for blocking spam-trackbacks. So please don't try to send pings repeatedly.

 

Comments

I was looking forward to this Plugin.
I want to try immediately.

On my host dreamhost.com website 8think.com, I've removed the following files and it just works fine for me:

# (mt home)/dot_htaccess
# (mt home)/url_convert.cgi
# (mt home)/default_templates/dynamic_site_bootstrapper.tmpl

maybe it is because it has implemented IRI feature.

This is plugin is really cool! it is useful for asians.Thanks.

Hi

I installed Cool-IRI yesterday, so cool, thanks for your great script. My blog is in Chinese, it showes correctly in IE(Chinese version), but it shows all "%" for Chinese part in Firefox(English version) not as IE(Chinese version).

In Firefox, network.standard-url.escape-utf8 can be set to "false". Then %HH escaping will be not done. This option works only for URL displaying purpose. People can access pages regardless of its value.

Hi Lee,

Thanks for your kind reply.

Also I've got your comment and email. Thanks alot.

About Cool_IRI, I'm not much understand you mentioned in your reply. You said "In Firefox", is it means that the Firefox I installed in my computer? If it is, then where is the "network.standard-url.escape-utf8 "? I've tried to find it in FireFox documents which located in my computer. But I didn't find it. This is one.

The second, if I set it as you said in my Firefox, then that means only I can read the URL correctly not the other people who use non-Chinese version Firefox.

The third, I hope this setting is in the plugin. I had found may times in the whole plugin, I still can't find it. If this setting in the plugin, then it means all the visitors using non-Chinese version's Firefox can read it correctly, that's real COOL!

The fourth, it can display correctly in my Chinese version IE, how about non-Chinese one will be displayed? I only have Chinese version, would you check it with yours?

The last :-) ,I think it's the most importent. If non-Chinese version of any find of browsers can't display it with Chinese, instead, they displayed with "%", then if somebody make link of my post, the URL of the post will have a lot "%" within the URL, but it's not my original URL. How about the search engines? And how about the PR passing? I'm just not much sure about this, so hopping can have a hand.

I think I said a lot for it. Sorry for disturbing you so much.

Thanks again!

Yang

This comment is the replay to Yang.
========================================================
Hi
You asked many things, and I think every questins is resonable. I'll tell you as I know/think.


1. In FF's address bar, enter "about:config" then all of FF's congif. will be listed. Too many lists.
So enter "utf" in "filter" area, then three or four config fields are shown. In there, you can set it to "true/false".

2. Yes, or no. I'm using Korean FF, but I can see your Chinese IRI. That doesn't matter with FF language version. That concerns with PC's installed font.

Of course, some Chinese users who have Chinese fonts in their PC, may see %HH encoded address. I'm not sure that in FF, default of network.standard-url.escape-utf8 is true or false. If "true", that FF user only see %HH address.

As you know, currently nealy all of netizen are not familliar with IRI. They are accustomed to old(?) URL which doesnt allow non ASCII. Some people say non ASCII address is INVALID, so say that we should not use it in URL(escaping to %HH is from this reason). But this limitation is too strict in non US users.

For now, IRI is in a transitional state. IE 6 doesnt support IRI well, (IE 7 will do). Moderen browsers like FF/Mozilla/Opera support IRI, but not perfectly or smoothly in current.


Let me introduce one example.

Do you know Internationalized domain name(IDN), which is inclued in wider meaning of IRI) ?
職業.com is one example. (I wish that doman name is to be shown in your environment)

When someone enter 職業.com in his browser, of course he will go to that site.
When I did it, my FF brower display it as http://xn--q6v940c.com/. Why?

FF provides network.IDN_show_punycode true/false config option in order for users to be able to choose IDN(職業.com) or Puny code( xn--q6v940c.com/) in browsers address bar.

This IDN problem(?) seems to be alike with our IRI display problem.

Again 2.
Yes. Users who turned off that IRI or IDN option only see %HH url or xn-***** puny code domain name.

That is the right of browser's user, I think it is not unresonable. End users had better have all power to control his browser.

The real problem(?) is that currently many people doesn't familliar with IRI/IDN, as a result they don't recognize the FACT that they have such options to turn it off/on.

But as time pass, more peple(mainly estern users) use IRI/IDN, more people know their right to disply theit own language character in browsers.

3.
As explained above, the appearence of IRI is a matter of only temporaily displaying address. Yes, that problem is not essential, but only on displaying purpose.

In server side, we have no method to control end user's action. For example, we can fix some font's size to absolute size. Previously we might think, yes~ all users will see my page with my fixed font size. IE 6 keep that. But all modern browser doesnt keep that size. They provide end users to control relative font size although page author fixed font size.

I think this appearence in browser's address bar is like above.

4.
You concern that if non Chinese users doesnt see right Chinese address, then can they point to my page? can they access my page?

I say YES. Of course non Chinses users may not see right Chinese address, but this doesn't mean they can't access that page.

For example, some IRI, alogblog.com/中國/index.html is one.
You may think, IRI address can be used by typing 中國 in browser's address bar.
Of course you can do that.

BUT, it not normal computing situation. Little users type URL by hands.
For example, American users who have no Chinese fonts CAN'T type/see 中國 directly.

However if he followed that page(中國/index.html ) from other page's link,
they still can access to that page.


As a conclusion, I would say that why do we use IRI?
In my case, why do I use Korean URL?
In your case, why do you Chinese URL?
Why does 職業.com use that domain instead of occupation.com?

I think users introducing IRI/IDN treat their own people(can understand/use/see their language) more importantly than potential world users.

If some page whose url is example.com/best_page.html have only Korean language, then only its ASCII url doen't guarantee non Korean users visiting.

So if your some page is for global contents, then write the title with English.
If some page have only Chisese characters and for users who understand Chinese, then write the title in Chinese.

I also follow that rule.

5.
If the linker of your Chinese IRI is the one who understand/use/see (in general, he may be Chinese), then he dont need to use %HH address, if he know that is only URI-transitional displaying purpose.

And for example, if the linker is American who can not use/see Chinses, they anyway have to use %HH in his ISO-8859-1 encoded page.

But as I said 4, most of time, your permalink will be linked in Chinese form.
If you worry about other's linking in %HH form because their browser show your IRI with %HH, I recommend that you embed your IRI permalink in Indiv. page. If someone want to link it, they will copy that Chnise address, not from browsers address.

More conventiently, you can provide some adress copy button for linkers.
http://alogblog.com/blog/archives/2004/06/Mozillar에서도_가능한_클립보드_복사_자바스크립트.php

If you want IE and Firefox copying Javascript, see above link.



In Korea, not so many people know IRI.
So I often include my Korean IRI to some other's page or forum, board intentionally. Then people who saw my IRI recognize that Ooo~ Korean URL also can be possible...

SE/PR/linking...and so on.
I would tell you that Dont' worry :)
Why?

If we use IRI, we have additional power than non-IRI or URI.
For example, I write Korean posting. What is its URL? if I dont use IRI, its URL may be one of two form. One is alogblog.com/my_good_posting.html, The other is alogblog.com/2006/234.html.

Yes it's ONLY address, not contents or title.

if we use IRI address and if Google or Yahoo don't understan our IRI meaning,
we have no additinal loss. We just dont additional gaining.

Good news is that Google understands IRI properly, not %HH forms.
(Korean) Yahoo seems to get %HH form. Even if so, we have no loss.

But as time pass, as we use more and more our own language in address, Yahoo also will get IRI with its original meaning. This is not difficult technically.


=======
As a conclusion, IRI is not maturing feature for now.
So for the time being, we might experience some unknown problems, but one certain fact is that Internet environment will move from URL(ASCII only, I hate that philosophy, although in the begining of Internet, it is not problem) to IRI.
Many browers will support more and properly, and more people will knwo and understand it.

It's a matter of time.

----------
Hew, it's too long,
I'm not good at English, so I'm afraid my thought will transit to you rightly.
I wish it will do.

Bye

Hi Lee,

It's great to see this!

Post a comment

☞ Your blog URL:

(A comment by authenticated (using TypeKey or OpenID) user will be shown immediately. Other comments need to be approved by the site owner. Until then, it won't appear on the entry. Thanks for waiting.)