Farsiweb: generate Farsi webpages
فارسیوب: برای تولید وبپیجهای فارسی
|
Farsiweb is a simple tool for generating web content in Farsi. Farsi,
which is also known as Persian, is the language of Iran. It is written
using the Arabic alphabet with some additional letters. Farsiweb works
by reading a file containing ASCII transliterations of Persian content,
and translating this to HTML-encoded Unicode suitable for display on
web pages. Farsiweb can correctly handle nested levels of
right-to-left and left-to-right text, for example, English words
appearing inside Persian text and vice versa.
Examples
Input |
Output |
<<esme man <<Peter>> ast.>> |
اسم من Peter است. |
He said: "<<esme man <<Peter>> ast>>". |
He said: "اسم من Peter است". |
<<man goftam "<<Horse is <<asb>> in Persian>>".>> |
من گفتم "Horse is اسب in Persian".
|
Here are some webpages that use Farsiweb:
Transliteration codes
For a code table, run "farsiweb --codetable
" or see the
file codetable.html.
Here are some of the special codes recognized by Farsiweb:
<< |
Begin transliteration, or begin nested Latin-encoded text |
>> |
End transliteration, or end nested Latin-encoded text |
<...> |
HTML tag. Enclosed text is not translated. |
[...] |
Literal text. The enclosing brackets are omitted in the output,
but the text between the brackets is copied literally. Useful for
putting literal HTML escaped characters inside Persian text,
e.g. [ ] . Only works in transliteration
mode. |
\ |
Escape. The following character is copied literally to the
output. Only works in transliteration mode. |
:: |
Separator. In transliteration mode, Farsiweb recognizes certain
multiletter sequences as a single symbol, e.g. "kh" for the Persian
letter "khe". If instead, two letters "k" and "h" are desired, they
can be separated as "k::h". The symbol "::" has no other effect; in
particular it does not generate a space or a word boundary. |
# |
Zero-width space. This symbol represents a word boundary without
generating any visible space. In particular, it prevents adjacent
characters from joining. This is useful for words such as
"banaabar#in" (بنابراین), which would otherwise become
بنابرین. Another example is "khaaneh#haa" (خانهها),
which would otherwise become خانهها. |
Additional notes:
- As you can see from the code table,
there are usually several different transliterations available for a
single Persian character. For example, the Persian character "و"
(vaav) can be written: "v", "u", "O", or "oo". The idea is that it
should be written the way it is pronounced.
- Some symbols have a special meaning when they occur at the
beginning of a word. For instance, the symbol "aa" or "A" normally
generates alef ("ا"), but at the beginning of a word it generates
alef with maddeh ("آ"). Similarly, the symbols "i", "u", and "O"
(but not "y" and "v") generate an additional alef when they occur at
the beginning of a word, for instance, "iraan" is ایران and not
یران, and "Oqaat" is اوقات and not وقات. Finally, the
symbols "a", "e", and "o" normally generate no output at all, except
at the beginning of a word, when they generate an alef "ا". These
rules have been designed so that the transliteration can be written
more or less phonetically. For each symbol that has a special form at
the beginning of a word, this form is shown in the pink column of the
code table.
HTML markup for bidirectional content
An HTML document with dominant right-to-left structure is created by
giving <body dir="rtl">
. It is also possible to add
the parameter dir="rtl"
(or dir="ltr"
for
embedded left-to-right content) to certain other tags, such as
<p>, <pre> and so forth.
In some rare cases, the "Unicode bidirectional algorithm", which is
responsible for the proper display of bidirectional text, fails to
produce correct results. One such example is a comma-separated list of
left-to-right items in right-to-left dominant text. For example, "the
letters a, b and c", translated into Persian as "<<Horufe
<<a>>, <<b>> va <<c>>>>"
displays incorrectly as:
حروف a، b و c
This can be fixed with HTML "bidirectional overrides". To get the
correct output, write "<<Horufe <bdo
dir=rtl><<a>>, <<b>> va
<<c>></bdo>>>":
حروف a، b و c
The reason for this "bug" is that commas are "neutral" characters in
the Unicode bidirectional model, and they take on the directionality
of whatever text surrounds them. Bidirectional overrides should be
used very sparingly and only when absolutely necessary; they tend to
mess up the correct display of numbers, which are always written
left-to-right.
Extensibility
It is relatively easy to add new symbols and transliterations to
Farsiweb. For example, someone might wish to add support for other
languages such as Arabic, or to add transliterations for less commonly
used symbols that are not currently supported. To change or extend the
code table, edit the source file codetable.c
and
recompile farsiweb.
Browser compliance
The output of Farsiweb can be viewed with any Unicode-compliant web
browser. Currently, Microsoft Internet Explorer is more or less
Unicode compliant, except that it does not display certain
characters. Mozilla and Firefox are somewhat Unicode compliant, but
under Linux there are some bugs in the display of right-to-left
scripts; they often fail to display the first character of each line,
or mysteriously display nothing until the text is highlighted.
However, Firefox for Windows seems to work fine. One can hope that
Unicode compliance in web browsers will continue to improve in the
future.
Download
Installation
Follow the generic installation instructions in the file INSTALL.
Command line options
Usage: farsiweb [options]
Options:
-h, --help -- print usage info and exit
-v, --version -- print version info and exit
-c, --codetable -- print table of transliteration codes
-u, --unicodetable n0 n1 -- print HTML table of unicodes from n0 to n1-1
The input is read from "standard input", and the output is written to
"standard output"; for example, this web page was generated by
invoking farsiweb with the following command line:
farsiweb < index.src > index.html
Version
Version 0.1, June 12, 2005
Author
Peter Selinger
LICENSE
Copyright (C) 2003, 2005 Peter Selinger
This program is free software; you can redistribute it and/or modify
it under the terms of version 2 of the GNU General Public License, as
published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307,
USA.
Back to Homepage:
Peter Selinger /
Department of Mathematics and Statistics /
University of Ottawa
selinger@mathstat.uottawa.ca
/ PGP key