[Progress Communities] [Progress OpenEdge ABL] Forum Post: RE: Read a PDF File using Progress

  • Thread starter Jean-Christophe Cardot
  • Start date
Status
Not open for further replies.
J

Jean-Christophe Cardot

Guest
David, I would nuance your sentence about pdf reading. It is not only about string handling, or you will not be able to read any pdf, only a subset of them, depending on which tool has been used to create it. There might be binary characters embedded in the strings, in particular CHR(0) - especially when using utf-16, which prevents normal ABL/4GL string handling from working. Also the strings can be encoded following various different schemes, so once you have extracted the string, you still have to decode it. For this reason and others, I have completely rewritten the pdf reader part of pdfInclude, which can now handle any pdf file. ABLPDF, being based on pdfInclude v3, has not had this rewrite (last time I checked), and as such cannot read every pdf file. Only specific ones, internally formatted the way ABLPDF is expecting them. pdfInclude has not this limitation any more and can read any pdf file, but the "strings" we are speaking above are the metadata of the file (author, title, subject, key words, various dates, strings defining hyperlinks, document outline/summary, etc.), not the contents itself. Reading the text content of a page is a completely different matter, much more difficult to achieve (and not implemented in pdfInclude yet). The very basic case can have the page content codified as ASCII and uncompressed, in which case it is easy to read. But this is not usually the case. First you have compression, easily defeated using zlib. Then you obtain all the pdf operators and arguments, including font selection, colours, graphical ,elements, drawings, etc. (you can even have embedded pictures here!) out of which you have to extract only the ones which output text. Not an easy task. Then you have the text in the page. The text? Not really, because it is indeed composed of font glyphs, i.e. not characters, but their graphical representation. There is no reason a glyph would have the same code as a character, even in the unicode space. Also the pdf file you are reading may have placed each character or word so that it looks nice on the page, but each word or even character can be given in any random order in the pdf file itself, provided they are preceded by text placement operators which will place them correctly on the page. So in order to really obtain the text, you would have to extract not only the text operators, but also the matrices and text placement operators, and figure out where on the page each given word/character is going. And this cannot be a pure algorithm. Heuristics will have to be implemented (i.e. when do you consider 2 characters are on the same line? if there one character is only one point (roughly 1/72 inch) below the other? etc.) So now you have the glyphs codes, positioned on a text file. Add to this the font subsetting system of pdf (a pdf file can embed a ttf file containing only the characters which have been used, in order to save space), then you have to open and parse the embedded ttf file (quite a complex task to be done in ABL, very far away from strings handling. I know it very well for having developed the fonts subset functionality in latest versions of pdfInclude) in order to get the correspondence between glyphs codes and character codes. Then yes, you have the text. This a very complex task, and the heuristics I'm speaking above (along with the lack of need) the main reason it has not been implemented in pdfInclude by the way. regards JC

Continue reading...
 
Status
Not open for further replies.
Top