[home]

Farsiweb: generate Farsi webpages
فارسی‌وب: برای تولید وب‌پیجهای فارسی



Farsiweb is a simple tool for generating web content in Farsi. Farsi, which is also known as Persian, is the language of Iran. It is written using the Arabic alphabet with some additional letters. Farsiweb works by reading a file containing ASCII transliterations of Persian content, and translating this to HTML-encoded Unicode suitable for display on web pages. Farsiweb can correctly handle nested levels of right-to-left and left-to-right text, for example, English words appearing inside Persian text and vice versa.

Examples

Input Output
<<esme man <<Peter>> ast.>> اسم من Peter است.
He said: "<<esme man <<Peter>> ast>>". He said: "اسم من Peter است".
<<man goftam "<<Horse is <<asb>> in Persian>>".>> من گفتم "Horse is اسب in Persian".

Here are some webpages that use Farsiweb:

Transliteration codes

For a code table, run "farsiweb --codetable" or see the file codetable.html. Here are some of the special codes recognized by Farsiweb:
<< Begin transliteration, or begin nested Latin-encoded text
>> End transliteration, or end nested Latin-encoded text
<...> HTML tag. Enclosed text is not translated.
[...] Literal text. The enclosing brackets are omitted in the output, but the text between the brackets is copied literally. Useful for putting literal HTML escaped characters inside Persian text, e.g. [&nbsp;]. Only works in transliteration mode.
\ Escape. The following character is copied literally to the output. Only works in transliteration mode.
:: Separator. In transliteration mode, Farsiweb recognizes certain multiletter sequences as a single symbol, e.g. "kh" for the Persian letter "khe". If instead, two letters "k" and "h" are desired, they can be separated as "k::h". The symbol "::" has no other effect; in particular it does not generate a space or a word boundary.
# Zero-width space. This symbol represents a word boundary without generating any visible space. In particular, it prevents adjacent characters from joining. This is useful for words such as "banaabar#in" (بنابر‌این), which would otherwise become بنابرین. Another example is "khaaneh#haa" (خانه‌ها), which would otherwise become خانهها.

Additional notes:

HTML markup for bidirectional content

An HTML document with dominant right-to-left structure is created by giving <body dir="rtl">. It is also possible to add the parameter dir="rtl" (or dir="ltr" for embedded left-to-right content) to certain other tags, such as <p>, <pre> and so forth.

In some rare cases, the "Unicode bidirectional algorithm", which is responsible for the proper display of bidirectional text, fails to produce correct results. One such example is a comma-separated list of left-to-right items in right-to-left dominant text. For example, "the letters a, b and c", translated into Persian as "<<Horufe <<a>>, <<b>> va <<c>>>>" displays incorrectly as:

حروف a، b و c
This can be fixed with HTML "bidirectional overrides". To get the correct output, write "<<Horufe <bdo dir=rtl><<a>>, <<b>> va <<c>></bdo>>>":
حروف a، b و c
The reason for this "bug" is that commas are "neutral" characters in the Unicode bidirectional model, and they take on the directionality of whatever text surrounds them. Bidirectional overrides should be used very sparingly and only when absolutely necessary; they tend to mess up the correct display of numbers, which are always written left-to-right.

Extensibility

It is relatively easy to add new symbols and transliterations to Farsiweb. For example, someone might wish to add support for other languages such as Arabic, or to add transliterations for less commonly used symbols that are not currently supported. To change or extend the code table, edit the source file codetable.c and recompile farsiweb.

Browser compliance

The output of Farsiweb can be viewed with any Unicode-compliant web browser. Currently, Microsoft Internet Explorer is more or less Unicode compliant, except that it does not display certain characters. Mozilla and Firefox are somewhat Unicode compliant, but under Linux there are some bugs in the display of right-to-left scripts; they often fail to display the first character of each line, or mysteriously display nothing until the text is highlighted. However, Firefox for Windows seems to work fine. One can hope that Unicode compliance in web browsers will continue to improve in the future.

Download

Installation

Follow the generic installation instructions in the file INSTALL.

Command line options

Usage: farsiweb [options]
Options:
 -h, --help                 -- print usage info and exit
 -v, --version              -- print version info and exit
 -c, --codetable            -- print table of transliteration codes
 -u, --unicodetable n0 n1   -- print HTML table of unicodes from n0 to n1-1
The input is read from "standard input", and the output is written to "standard output"; for example, this web page was generated by invoking farsiweb with the following command line:
farsiweb < index.src > index.html

Version

Version 0.1, June 12, 2005

Author

Peter Selinger

LICENSE

Copyright (C) 2003, 2005 Peter Selinger

This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License, as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.


Back to Homepage: [home]


Peter Selinger / Department of Mathematics and Statistics / University of Ottawa
selinger@mathstat.uottawa.ca / PGP key