Sunday, April 30, 2023

Want to OCR Your Tibetan-script PDFs?

Mayday! Mayday! No idea what I’m doing here.

Not a techie, this is meant to help people like myself who are not techies themselves, who are rather more like humans in humanities who want to make the most out of their computer’s innate or potential ability to search through Tibetan-script texts.


If you have a Google account already, it ought to be easy. Go to your account and then choose "Google Drive."  Just upload (click on the "+ New" button) your PDF. Once it is up there, you need to "right click" (search in Google if you have a Mac to find out "How to right click on a Mac"). Right clicking opens up a small menu from which you have to "Open with Google Doc." That does it!  Let us know how it works for you.


If you are in the mood to experiment some more, pay special attention to the message from Zach, and the links he supplies, at this Google discussion page called “Tesseract for Tibetan.”


If you don’t know what Tesseract is, well, you can Google it! That’s what I did.


OpenPecha also has this very useful page:

https://medium.com/@OpenPecha/how-to-get-google-cloud-vision-to-ocr-tibetan-again-e810a1d402ce


Notice, too, that Tibetan translation has appeared in some of those translation applications. The one I've noticed and tested is the BING:

https://www.bing.com/translator/

Just go there and see what happens. You may be surprised for better or for worse. Still, it’s worth a try.


If you have suggestions you think other humans can use, just drop it in the comment box. We’ll appreciate it. Artificial intelligences need not apply.  You could say I am not a robot, or I am not a rabbit, although I am both and neither, or rather neither both nor neither...

In case you encounter a CAPTCHA* on your way to posting your comment you’ll know what to tell it.

(*CAPTCHA, or a “Completely Automated Public Turing Test to Tell Computers and Humans Apart.” I Googled it.)

One more word of advice: If you want to test out the OCRing abilities of Google Drive or whatever, make sure you start with a PDF made with a machine-readable Tibetan font. Do not try to use a scan of a woodblock print,* and by all means avoid cursive texts of all kinds.**

I’m just saying this because I’d like your experiment to be a pleasant and productive one. Otherwise you run the danger that even the Word of the Buddha could be reduced to what is, in our human colloquial, called “garbage.”

(*Save that particular experiment for later. **Unless, of course, they themselves were made with computer script, which is an unlikely possibility.)


———

btw

Soon all my blogposts will be A.I. generated. Since their “process” often results in cogent yet stupid statements, they require peer reviewers, so there will still be work for us humans, no worries! Why can’t they peer review themselves, you may ask? Because each one is the total peer of the other, which makes it impossible for them to judge one of their kind against another, or that’s how I understand it. Wow, I hadn’t thought of it this way before, but maybe that is an advantage they hold over us. For them equality is not just lip service.


Postscripts (October 2, 2023):

I just noticed this article from over a decade ago that ventures to solve the problems in machine recognition of Tibetan woodblock printed script.

I did try posting a blog using A.I., and you can check the results in the entry entitled The Land of Snows, along with The Seven Seals of Tibet: A Joycean Journey. You be the judge.

6 comments:

  1. Hi rTen, thanks for that! It's really helpful indeed. One thing though: you cannot ORCize a whole volume in a single click. You can only do so for a couple of pages at a time. But it's a great improvement from previous methods.

    ReplyDelete
  2. Has anyone tried these OCR converters?

    https://onlineocrconverter.com/free-ocr-in-tibetan
    https://www.i2ocr.com/pdf-ocr-tibetan

    ReplyDelete
  3. Yes, I have tried them. They both give the same good results with newer publications typed on a computer. The first one does not accept PDFs, just images (png, etc.), one page at a time. The second one does indeed do PDFtoOCR and takes multiple page documents. However, when OCRizing a manuscript, the results are not usable, and I haven't even tried with a good dbu-med mss (like that of Ratna Lingpa's collected termas for instance where the copyist is really gifted).

    ReplyDelete
  4. I stumbled across your post after trying the i2ocr, and having mixed results. It requires you to do one-page-at-a-time. The recognition was fair. But, it routinely left out lines on a page (trying to OCR a text in machine font, it would leave out one full line of text). My idea was to make a rough OCR a text I was wondering about, then quickly outline the sa-bcad. Good plan! i2ocr had a knack for leaving out the sa-bcad lines (gnyis pa ni...), which was a bit maddening. I follow your links and see what Zach has to say!

    ReplyDelete
  5. I just noticed quite a simple and intelligible (in the beginning) page on OCRing Tibetan at digitaltibetan.com. Have a look: https://digitaltibetan.github.io/DigitalTibetan/docs/tibetan_ocr.html

    ReplyDelete
  6. I just tried to OCR a Tibetan PDF using i2ocr and the quality seemed somewhere between good and not good enough. Then I tried it with Google Drive; I have yet to go through line by line but it seems at least as good as i2ocr, with the added benefit of catching that headers should be different font sizes and highlighting potential misreadings. I'll go through line by line soon and may have an update to this soon

    ReplyDelete

Please write what you think. But please think about what you write. What's not accepted here? No ads, no links to ads, no back-links to commercial pages, no libel against 3rd parties. These comments won't go up, so no need to even try. What's accepted? Everything else, even 1st- & 2nd-person libel, if you think they have it coming.