Python Arabic Text Reshaper

I was trying today to generate PDF reports using Geraldo Reports and I needed to generate reports with Arabic text in them. Arabic is a very special script language with two essential features:

  1. It is written from right to left.
  2. The characters change shape according to their surrounding characters.

So when you try to print Arabic text in an application – or a library – that doesn’t support Arabic you’re pretty likely to end up with something that looks like this:

We have two problems here, first, the characters are in the isolated form, which means that every character is rendered regardless of its surroundings, and second is that the text is written from left to right.

To solve the latter issue all we have to do is to use the Unicode bidirectional algorithm, which is implemented purely in Python in python-bidi. If you use it you’ll end up with something that looks like this:

The only issue left to solve is to reshape those characters and replace them with their correct shapes according to their surroundings.

I solved this issue more than four years ago in a small application that I wrote in Visual Basic, my solution was naive but it solved it well, anyway, a few days ago I faced the same problem – rendering Arabic text correctly – but on Android, and I searched and used the solution in this SO answer, which is pretty similar to the solution provided in Better Arabic Reshaper.

Today I ported the solution in Better Arabic Reshaper from Java to Python, tweaked it a little bit, and used it to successfully render Arabic text in PDF, and the result was:

Pretty cool right? Here is another test with English text in it some diacritics:

It looks fine! in Word the same text looks like this:

Amazing, now it is time for you to use the ported library along with python-bidi to solve those issues.

Usage

Demo

You can try an online demo of this script on my Python/Django site here: Arabic Reshaper Online.

Download

The source code is licensed under the GNU Public License (GPL).

Project on GitHub
Source code download from GitHub

Have fun واستمتع! :)

55 thoughts on “Python Arabic Text Reshaper

  1. بارك الله فيك اخي عبدالله
    أرجو ان يساعدني هذا في العديد من الامور التي لا تدعم اللغة العربية
    أتمنى أن نتواصل على الايميل

  2. Hello Abd,
    Thank you for this *extremely* valuable port. Quick question, regarding “single letters”.
    Your algorithm reshapes an isolated letter, such as ض (\u0636) into a shaped one : ﺿ (\uFEBF).
    I don’t think this is correct (?) I consider adding a line of code at the very first line of the function “get_reshaped_word” to exclude 1-letter words. Would it make sense?

    def get_reshaped_word(unshaped_word):
    if len(unshaped_word) == 1: return unshaped_word ### <—– New
    unshaped_word = replace_lam_alef(unshaped_word)
    decomposed_word = DecomposedWord(unshaped_word)

  3. Thanks for this project, just wanted to inform that my problems regarding the error :

    UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-4: ordinal not in range(128)

    is solved by putting the following lines in arabic_reshaper.py :

    import sys
    reload(sys)
    sys.setdefaultencoding(‘utf-8’)

  4. In first, thanks for sharing this code, but i have a problem with the example that you provided.

    pass_arabic_text_to_render(bidi_text)
    NameError: name ‘pass_arabic_text_to_render’ is not defined

    1. Welcome,
      Your error is because this is actually not a method, it is just to say that you should instead of this line call your rendering method which will accept the Arabic text and render it, so it might be PDF printing, or simply PIL image or anything.

      Cheers.

  5. Assalam Alykum

    Thank you brother for your great effort and sharing it , Now i can finally use beautiful arabic fonts in Linux for OpenERP arabic Reports.
    which the arabic_reshaper.py was suggested as a part of solution for OpenERP arabic reports in https://github.com/barsi/openerp-rtl

    i have noticed that there is vertical alignment Problem when generating the reports . the data is not vertically well aligned. am just asking is this issue related to the reshaper or to the Reportlab represntation for the arabic font.

    note that before i use the solution in the link [ https://github.com/barsi/openerp-rtl ] some fonts were well aligned but they have the square thing issue , now they are ok but not well aligned vertically !!!

    1. Wa Alaikom Al Salaam,

      Thanks Razan for using this solution, the problem you’re having is due to the font you’re using I think, because I’ve used multiple fonts with Arabic text and Python and it went well without this vertical alignment problem, so you should experiment with multiple fonts till you find the best one for you, I tried Arial and Helvetica, try them if you want.

      Good luck…

      1. Thanks, i’ve tried various Fonts even Arial but still have the same problem, now i find that the alignment for Reports in Reportlab engine is in the paragraph.py file and that’s where comes the problem now am trying some tricks.

        thanks again

  6. Hello,

    I’m using your module together with bidi and it’s clear the arabic text itself is correct and well wrapped whether in console or in text editor. However I need to render Arabic text properly as Paragraph entity in Reportlab, but I’m only facing a problem with word wrap (RTL text is wrapped, but with new line above, not under). How did you passed through this?

    best regards and thanks for your effort
    Marek

    1. Hi Marek,

      Can you show me an example on what is going on? The only problem I know when dealing with paragraphs is that when the text needs to be wrapped it’ll be messed up, and you need to break it into lines before you reshape it.

      1. Hi Abd Allah,

        thanks for the response. Let’s say I have this Arabic snippet:
        إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة
        In English this should mean something like: “If we take into account the nature of climate variability and inter-annual variability and those on long-term addition to the lack of accuracy of measurements and calculations used….”

        Now I want to render it as Reportlab PDF doc:

        arabic_text = u’إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة’
        arabic_text = arabic_reshaper.reshape(arabic_text) # join characters
        arabic_text = get_display(arabic_text) # change orientation by using bidi

        pdf_file=open(‘disclaimer.pdf’,’w’)
        pdf_doc = SimpleDocTemplate(pdf_file, pagesize=A4)
        pdfmetrics.registerFont(TTFont(‘Arabic-normal’, ‘../fonts/KacstOne.ttf’))
        style = ParagraphStyle(name=’Normal’, fontName=’Arabic-normal’, fontSize=12, leading=12. * 1.2)
        style.alignment=TA_RIGHT
        pdf_doc.build([Paragraph(arabic_text, style)])
        pdf_file.close()

        The result is here https://www.dropbox.com/s/gdyt6930jlad8id/disclaimer.pdf. You can see the text itself is correct and readable (at least for Google Translate :-)), but not wrapped as expected for RTL script.

        best regards
        Marek

    1. وعليكم السلام، هذه مشكلة عرض فقط ضمن الصفحة الموجودة على الرابط، لكن عند نسخ النص ولقصه في أحد التطبيقات ستجد أن المشكلة غير موجودة.

      شكراً لك.

  7. Thank you for this extremely valuable port, which helped generate printed registration rolls for over a million voters in Libya.

    There is a minor bug with the lam-alef glyphs, which appears to be from the original Java package, as I have noted in GitHub issue #2.

    We have also mirrored the RTL branch of reportlab to GitHub, in case others would like to use it without installing mercurial.
    https://github.com/hnec-vr/reportlab-rtl

    1. Thank you Josh for your reply and your bug report, I fixed it in GitHub.

      Would you be able to send me the case study for your project that you used this script in? I would love to see how people are using it :) My email is mpcabd {( AT )} G Mail [DOT] COM

      All the best :)

    2. Hi Josh & Abd Allah,

      I am still confused how to break a block of Arabic text into lines – a reportlab’s paragraph. Starting from the right side of a page, the text should run to the left margin and continue on a new line bellow and right. This is not so, when I run the code against the reportlab-rtl branch. In PDF I got this:
      ‫و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة‬
      ‫إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ‬
      instead of this:
      إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة

      This is the complete code (using reportlab-rtl, python-bidi and Abd Allah’s reshaper):

      #encoding:UTF-8
      from reportlab.lib.pagesizes import A4
      from reportlab.platypus.doctemplate import SimpleDocTemplate
      import arabic_reshaper # Abd Allah’s code
      from bidi.algorithm import get_display # python_bidi
      from reportlab.pdfbase import pdfmetrics
      from reportlab.pdfbase.ttfonts import TTFont
      from reportlab.lib.styles import ParagraphStyle
      from reportlab.lib.enums import TA_RIGHT
      from reportlab.platypus.para import Paragraph

      pdf_file=open(‘disclaimer_arabic.pdf’,’w’)
      pdf_doc = SimpleDocTemplate(pdf_file, pagesize=A4)
      arabic_text = u’إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة’
      arabic_text = arabic_reshaper.reshape(arabic_text) # join characters
      arabic_text = get_display(arabic_text) # change orientation by using bidi
      #english_text = ‘If we take into account the nature of climate variability and inter-annual variability and those on long-term addition to the lack of accuracy of measurements and calculations used’
      pdfmetrics.registerFont(TTFont(‘Arabic-normal’, ‘KacstOne.ttf’))
      style = ParagraphStyle(name=’Normal’, fontName=’Arabic-normal’, fontSize=12, leading=12. * 1.2)
      style.alignment=TA_RIGHT
      pdf_doc.build([Paragraph(arabic_text, style)])
      pdf_file.close()

      best
      Marek

        1. I have tried it already (it looks very promising :-)), but unfortunately it has no effect, at least with my code…

  8. Hi Josh & Abd Allah,
    I was trying reportlab-rtl branch with reshaper and bidi. Reportlab’s paragraph doesn’t seem to be RTL enabled, because the block of Arabic text is not properly broken into lines automatically. The text running from right side is expected to continue on the new line bellow and right. This is not so, the new line appears above. Is this feature missing in reportlab Paragraph class for RTL text? It works for LTR.
    all the best

  9. Salam,

    I have re-wrote your library to haxe language so that I can port it to php, javascript, c sharp, c++, java, but I didn’t re-write the method get_display
    The question is, why shall I use get_display to reverse the text? I can simply reverse it simply by iterating through the letter via a simple loop, right?

    Also, I have tried it, but I got this result: , so why the ALEF looks like LAM ? please see here:
    https://drive.google.com/file/d/0BwzBTCo1-KJBSHA4c25GRXZzNkE/edit?usp=sharing

    1. Salam Samir,

      You should use the get_display to reverse the text, and yes I assume you can simply reverse iterate through the text but I think that the get_display does more than just that. See here.

      And as for your little problem with the Aleph, it’s certainly showing as the end-form Aleph (U+FE8E) which I think is happening because of a mistake copying/porting the code, check your code that corresponds to this line and this line.

      Good luck.

      1. Thanks brother! I’ve solved the problem, it was a brackets issue, but I’ve noticed that when I use the shadda, I get doted circles,
        I tried the text:

        ‘التّرجمة الفوريّة’

        Which has shadda in each word, but here is the result:
        https://drive.google.com/file/d/0BwzBTCo1-KJBbElZejRMY3JtRTg/edit?usp=sharing

        Any advice? may be I have to use specific fonts that has fully unicode support for arabic? is that’s the problem, what fonts do you suggest?

        Thank you!

    1. This empty space is because the Shadda is a non-space character, my recommendation is to strip the text from diacritics (حركات) before you reverse it, because they won’t work properly after reversing the text.

          1. Dear brother, I think you can simply add more unicode cases that would cover all the arabic letters with diacritics on, I think there is a unicode for each case,, but I am not sure…

            1. Then, there must be a way to “merge” the shadda glyph with the previous letter glyph, notice that the shadda glyph is empty at the buttom, if you place the shadda over the previous letter, then you will get the correct result, but, how would we do that?!

            2. Sure that would be possible but it’s a rendering issue which is out of the scope of this script, as you can see that the script doesn’t render the text it just reshapes it so that you pass it reshaped to the script that renders it.

  10. اولاً شكراً على المجهود الجيد
    لاكن لم افهم لماذا تحتاج الى مكتبة بايثون بايدي
    تستطيع ان تستغني عنها بإظافة هذا الكود
    RTL = “”
    for letter in reshaped_text:
    RTL= letter + RTL

    في النهاية

    وهذا البرنامج كامل

    # -*- coding: utf-8 -*-

    import arabic_reshaper
    reshaped_text = arabic_reshaper.reshape(u’اللغة العربية رائعة’)
    RTL = “”
    for letter in reshaped_text:
    RTL= letter + RTL
    print RTL

    شكراً لك

    1. في الواقع لا يمكنك الاستغناء عن مكتبة
      python-bidi
      واستبدالها بطريقتك، حيث أن طريقتك ستفشل في حال وجود أرقام أو محارف غير عربية ضمن النص.

      شكراً لك على التعليق :)

  11. Salam o Alykum.
    Newbies can not undertand how to use this scrip. I have the same issues with OpenERP reports and also with Arabic fonts right/left side.
    It will be very great if some can give a bit more detail how to use this scrip from the scratch on ubuntu 14.04.
    thanks

    1. Hi Zubair,

      Please note that this is not a pip package, you have to download the script from github, and put it in your PYTHONPATH or next to your script. You will also need to install python-bidi which can be easily done through pip install python-bidi. After that you will be able to call arabic_reshaper.reshape on your text and then pass it to bidi.algorithm.get_display to make it ready to be printed or passed to some other library that will handle the rendering of the text.

      Regards.

  12. Dear Abd,

    Thanks for the reshaper. I do not know Arabic. I was testing the arabic reshaper. I would like to get some feedback if possible.

    My understanding is that the proper way to write yeh followed by teh would be “یت” . My question is what should arabic reshaper produce once it applied to یﺖ

    1. Hi Kursat,

      The letter ى (Alef Maksura) is a letter that is used only in its isolated and final form only, so no character should be after it. Check the Unicode forms of it here: http://www.fileformat.info/info/unicode/char/search.htm?q=ARABIC+LETTER+ALEF+MAKSURA&preview=entity

      The only problem is that in Egypt they use this letter as a replacement for the letter ي (Yeh) in the final form, this is not proper Arabic, it’s only in the Egyptian colloquial language, which is, IMO unfortunately, used a lot on the internet that even it has its own Wikipedia language (http://arz.wikipedia.org/).

      So to summarize, the reshaper should show these characters separated.

      An example would be trying to reshape “ذهبت إلىمنزلي” (I went home) which has a mistake of not separating إلى from منزلي, and it will reshape to , while the rendering engine on my Chrome shows it like which is not proper Arabic.

      I hope this answers your question.

  13. Dear Abd,

    Thank your for your answer. I am not very good with git or github.

    So, I would like to point a line of code where YEH and ALEF MAKSURA gets mixed up in your code in ARABIC_GLYPHS:

    u’\u06CC’ : [u’\u06CC’, u’\uFEEF’, u’\uFEF3′, u’\uFEF4′, u’\uFEF0′, 4]

    FEEF and FEF0 are codes for ALEF MAKSURA.

    Instead this line must be:

    u’\u06CC’ : [u’\u06CC’, u’\uFEF1′, u’\uFEF3′, u’\uFEF4′, u’\uFEF2′, 4]

    Thanks,

    k.

    1. Hi Aker,

      Sorry for the late response, I was a bit busy.

      The Alef Maksura character used is U+0649, and its only two forms used are U+FEFF (isolated) and U+FEF0 (final), while for the Yeh character it is U+064A, and its four forms used are U+FEF1 (isolated), U+FEF3 (initial), U+FEF4 (medial), U+FEF2 (final). The character you’re referring to U+06CC is not the Arabic Yeh or the Arabic Alef Maksura, it is the Farsi Yeh which has four forms U+FEEF (isolated) same as Alef Maksura, U+FEF3 (initial) same as Yeh, U+FEF4 (medial) same as Yeh, U+FEF0 (final) same as Alef Maksura.

      I hope this clears out the confusion.

      Regards.

  14. Dear Abdullah,

    I’m a newbie on using reportLab and I’m trying to use your re-shaper code for creating a Urdu document. My full code is as follow:

    from reportlab.pdfgen import canvas
    from reportlab.pdfbase import pdfmetrics
    from reportlab.pdfbase.ttfonts import TTFont

    import arabic_reshaper
    from bidi.algorithm import get_display

    pdfmetrics.registerFont(TTFont(‘Urdu’,’JameelNooriNastaleeq.ttf’)) # Urdu Nastaleeq font

    c = canvas.Canvas(filename = ‘test2.pdf’,pagesize=’A4′)

    x = 250
    y = 500

    text = u’عدنان الحسن’
    reshaped_text = arabic_reshaper.reshape(text)
    bidi_text = get_display(reshaped_text)

    c.setFont(‘Urdu’,30)
    c.drawString(x,y,bidi_text)

    c.showPage()
    c.save()

    But, unfortunately, I’m unable to get anything in the pdf file generated by above code. The output from my code can be seen on following link:
    https://www.dropbox.com/s/pa252v1858xldn2/test2.pdf?dl=0

    I would be grateful if you can suggest me a solution to this problem.

    Thanks in advance!
    Adnan

        1. Okay, I managed to repeat the process you showed in the video, but still, I wasn’t able to get anything in pdf file. It’s an empty pdf file.

Leave a Reply