Showing posts with label OCR. Show all posts
Showing posts with label OCR. Show all posts

Sunday, April 30, 2023

Want to OCR Your Tibetan-script PDFs?

Mayday! Mayday! No idea what I’m doing here.

Not a techie, this is meant to help people like myself who are not techies themselves, who are rather more like humans in humanities who want to make the most out of their computer’s innate or potential ability to search through Tibetan-script texts.


If you have a Google account already, it ought to be easy. Go to your account and then choose "Google Drive."  Just upload (click on the "+ New" button) your PDF. Once it is up there, you need to "right click" (search in Google if you have a Mac to find out "How to right click on a Mac"). Right clicking opens up a small menu from which you have to "Open with Google Doc." That does it!  Let us know how it works for you.


If you are in the mood to experiment some more, pay special attention to the message from Zach, and the links he supplies, at this Google discussion page called “Tesseract for Tibetan.”


If you don’t know what Tesseract is, well, you can Google it! That’s what I did.


OpenPecha also has this very useful page:

https://medium.com/@OpenPecha/how-to-get-google-cloud-vision-to-ocr-tibetan-again-e810a1d402ce


Notice, too, that Tibetan translation has appeared in some of those translation applications. The one I've noticed and tested is the BING:

https://www.bing.com/translator/

Just go there and see what happens. You may be surprised for better or for worse. Still, it’s worth a try.


If you have suggestions you think other humans can use, just drop it in the comment box. We’ll appreciate it. Artificial intelligences need not apply.  You could say I am not a robot, or I am not a rabbit, although I am both and neither, or rather neither both nor neither...

In case you encounter a CAPTCHA* on your way to posting your comment you’ll know what to tell it.

(*CAPTCHA, or a “Completely Automated Public Turing Test to Tell Computers and Humans Apart.” I Googled it.)

One more word of advice: If you want to test out the OCRing abilities of Google Drive or whatever, make sure you start with a PDF made with a machine-readable Tibetan font. Do not try to use a scan of a woodblock print,* and by all means avoid cursive texts of all kinds.**

I’m just saying this because I’d like your experiment to be a pleasant and productive one. Otherwise you run the danger that even the Word of the Buddha could be reduced to what is, in our human colloquial, called “garbage.”

(*Save that particular experiment for later. **Unless, of course, they themselves were made with computer script, which is an unlikely possibility.)


———

btw

Soon all my blogposts will be A.I. generated. Since their “process” often results in cogent yet stupid statements, they require peer reviewers, so there will still be work for us humans, no worries! Why can’t they peer review themselves, you may ask? Because each one is the total peer of the other, which makes it impossible for them to judge one of their kind against another, or that’s how I understand it. Wow, I hadn’t thought of it this way before, but maybe that is an advantage they hold over us. For them equality is not just lip service.


Postscripts (October 2, 2023):

I just noticed this article from over a decade ago that ventures to solve the problems in machine recognition of Tibetan woodblock printed script.

I did try posting a blog using A.I., and you can check the results in the entry entitled The Land of Snows, along with The Seven Seals of Tibet: A Joycean Journey. You be the judge.

 
Follow me on Academia.edu