I came across Laurent Grégoire's CVS Quick Reference Card. I needed something quick and handy to put on my thumb drive, so I figured this would do. Only problem is that my Windows box only understood Adobe's PDF format. I've grown to really dislike PDF, primarily for the fact that the Adobe Acrobat Reader takes forever to come up and has become bloated. Since I have Adobe Acrobat Professional Version 8.0.0 installed here, I thought I'd see what formats I could convert the file into for doing some minor edits to the file

The CVS Quick Reference Card is a 3-columned, 2-page document consisting of 99% text, 2 horizontal lines, 2 vertical lines and a couple bullet symbols. Here is a screenshot of the top-left of the document:

Screenshot of a portion of the CVS Quick Reference Card pdf consisting of text laid out in columns

Looks nice. The PDF weighs in at 84kb. Not bad, really. But what I want to do is add a couple more items to the cheat sheet and send it back to Laurent (like cvs annotate for instance). This is not easy with a document in PDF format. These are the types of modifications that I'd like to do to PDFs every once in awhile and it's why I consider PDF more of a closed "publishing" format than a true collaborative document format. I can see the appeal of an electronic document that you know cannot be changed, but for my day-to-day use such documents are far and few between. Plus I just can't stand the Acrobat Reader, I'm sorry. 😉

Adobe Output

I was surprised to see a variety of formats listed in Acrobat Professional under Save As (though unsurprised that SVG was not present). I decided that since I recently started experimenting with GTDTiddlyWiki on my thumb drive for note-taking, that I would keep everything HTML-browser-centric and settled on the format "HTML 4.01 with CSS 1.0 (*.htm, *.html)". What's the worst that can happen?

Well, the first thing that happened was I got back some errors saying that some of the glyphs could not be converted. Ok, I can live with some glyph funkiness.

Then I brought up the resultant 249kb .htm file in Firefox and was dismayed to see it was not at all laid out in the nice 3-column layout of the PDF. This pretty much makes the HTML+CSS output from Adobe Acrobat unusable.

Then I looked at the source and was a little dumbfounded:

<BODY bgcolor=white text=black link=blue vlink=purple alink=fushia >

<P>

<SPAN style="color:#000000"

>CV</SPAN

><SPAN style="color:#000000"

>S </SPAN

><SPAN style="color:#000000"

>QUIC</SPAN

><SPAN style="color:#000000"

>K </SPAN

><SPAN style="color:#000000"

>REFERENC</SPAN

><SPAN style="color:#000000"

>E </SPAN

><SPAN style="color:#000000"

>CAR</SPAN

><SPAN style="color:#000000"

>D </SPAN

></P>

<P style="margin-bottom:0px; margin-left:0px; line-height:16px">

<SPAN style="font-family:'sans-serif', 'CMT I'; color:#000000"

>Overvie</SPAN

><SPAN style="font-family:'sans-serif', 'CMT I'; color:#000000"

>w </SPAN

></P>

Maybe someone understands the logic of how contiguous text is broken up into many <span> elements with the same style, but I certainly do not.

I tried Plain Text - not a very legible document (well, at least not the nice 3-column layout of course).

I tried PNG - good but each PNG page was 2339x1654 pixels requiring some manual scaling down for readability on my screen.

I tried JPG - good, but fuzzy if not fullscreen.

In fairness to Adobe, the RTF output was pretty legible and only 44kb.

SVG Output

So then I thought - I'm already half-into this, let's try to get a decent SVG out of it.

I was going to try Inkscape, but their PDF Import feature is still not there yet. Too bad.

I next tried the evaluation version of the PDF2SVG command-line tool. The SVG output was two pages, totalling 526kb uncompressed. Bringing it up in the best available desktop SVG viewer pretty much crippled the browser (extremely sluggish) and resulted in the following:

SVG output of CVS Quick Reference card as rendered in Opera 9.5 Beta showing poor quality of SVG output

Firefox and Safari were worse.

I then looked at the SVG source:

<g clip-path="url(#clp1)" transform="matrix(1 0 0 -1 0 595.276)">

<text transform="matrix(1 0 0 -1 0 0)"><tspan x="58.333,66.612,75.27" y="-556.97" class="ps00 ps20">CVS</tspan><tspan x="85.452,94.06,102.88,107.22" y="-556.97" class="ps00 ps20">QUIC</tspan><tspan x="115.49" y="-556.97" class="ps00 ps20">K</tspan><tspan x="128.29,136.88,144.41" y="-556.97" class="ps00 ps20">REF</tspan><tspan x="151.62,159.15,167.74,175.27,184.24" y="-556.97" class="ps00 ps20">ERENC</tspan><tspan x="192.5" y="-556.97" class="ps00 ps20">E</tspan><tspan x="203.85,212.13,220.79" y="-556.97" class="ps00 ps20">CAR</tspan><tspan x="229.39" y="-556.97" class="ps00 ps20">D</tspan></text>

<g id="xfrm1" transform="matrix(1 0 0 1 27.78 552.638)">

<g id="q2" class="ps00 ps20">

<path d="M0 0.199 L240.94 0.199" class="ps01 ps10"/>

</g>

<g id="xfrm3" transform="matrix(1 0 0 1 -27.78 -552.638)">

<text transform="matrix(1 0 0 -1 0 0)"><tspan x="27.78,35.422,40.005,44.588" y="-534.32" class="ps00 ps21">Over</tspan><tspan x="48.782,53.365,56.424,61.007" y="-534.32" class="ps00 ps21">view</tspan>

Next up was matterCast's SVG Imprint. The output clocked in at 334kb uncompressed. The SVG was similarly mangled with <tspan>s breaking up the text as above. Finally, they didn't produce valid SVG (the root <svg> node was missing the namespace declaration: xmlns="http://www.w3.org/2000/svg"). Once I fixed that, the files looked similar to the above, though the browser wasn't as sluggish.

I then found this service which converts submitted PDFs to SVGs and then sends you a link for free. It's really a promotion of its TextCafe conversion product which sounds nice in theory, especially the "Detection of text blocks and paragraphs, which can be reflowed automatically". I sent the PDF to them and didn't hear anything back for an hour. So I emailed them and got a response from Martin Hensel that it takes 6-12 hours. When I got it, it was a zip file containing an HTML harness and some SVG, JPG files. The SVG files have the same problems as other similar tools - basically a whole bunch of <tspan> elements instead of text blocks. The HTML harness is kind of a neat idea because it provide bookmarking/search pane similar to Acrobat Reader (provided via JS). However, the HTML harness insists that you have to install Adobe SVG Viewer. This might have been a good idea 2 years ago, but these days not only is ASV no longer supported by Adobe, but all but one browser supports enough SVG these days to be useful.

This isn't Opera or SVG's fault. I guess it's really just an algorithmic problem? The SVG output is similar in nature to the HTML output: contiguous text is mangled into subsequent spans/tspans with the same style applied. I'm really curious if anyone has a clue why this happens - is it that the conversion engines are trying to duplicate (down to the pixel) Adobe's kerning from the PDF source and fails so it just defaults to fragments of text? Is it that the PDF is "optimized" in such a way that it's not possible to determine what was a contiguous chunk of meaningful text? Looking at Texterity's metadata.js it seems that at least they can determine a list of indexable terms from the PDF...

I don't buy the idea that SVG is too verbose for something like this either. Zack Rusin's git Cheat Sheet has fancy flow charts as well as hunks of text and was produced by Inkscape (not known for the conciseness of their SVG, shall we say?). That files weighs in at only 161kb and that's not even compressed. And it looks great in every browser that supports SVG (ok, I saw one glitch in Firefox).

Costs

Adobe Acrobat Profressional Version 8 = $450 USD

matterCast's SVG Imprint = $199.95 USD (requires .NET 1.1)

PDFTron's PDF2SVG = $549 USD (without annual maintenance contract)

Texterity's FreeSVG = Free service if you're willing to submit your PDF to them

Texterity's TextCafe = Must obtain a quote

Conclusion

Like I said at the top, I can appreciate the need for a document format that you know will render pixel-perfectly the way you want. I can also appreciate that some authors want to make it difficult/impossible for other people to modify their documents. My gripes about PDF these days are mostly about using Acrobat Reader, but I think my experience this evening did indicate to me how difficult it is to convert from optimized PDF to some other format... it was an education. I think it's fair to say that conversion to PDF is more-or-less a one-way street. [Update: See François's comment below - he thinks it's the fault of the TeX->PDF conversion. Since I don't know enough about TeX and don't have the time to really dig into the source, I'll believe him.]

And yeah, I'm aware that my Google ads will probably be all for PDF converters. The universe is funny that way.

§418 · January 11, 2008 · Adobe, Software, Technology · · [Print]

Leave a Comment to “Adobitrocity”

  1. Wow, the PDF2SVG price tag is only per-CPU. They must have an amazing marketing team to compensate for the quality (okay, in this case) and the command-line-ness.

    I’d go for the tex source file. I opened the PDF in pdfedit (free software, in ubuntu’s universe), and the words are already broken up inside the PDF file. Make me wonder why they can’t/(don’t want to) make it less opaque.

  2. I mean opaque as in “open in text mode in emacs, and can’t find any string in it”. Sorry for the double comment.

  3. Okay, so you hate Acrobat. You hate bloated ? Look at the Windows OS – you must really hate that. Now, if there was no Acrobat, and we wanted to exchange digital version of the New York Times, what would you use (if Acrobat did not exist?) Okay, so, lets agree that PDF version 1 was probably all YOU needed for you little csv tip guide, but many people in many verticals need different things (ePublishing, engineering drawings, Tax forms with validation scripts) so, please, don’t bore us with you ‘but i want it to be tiny!’ – you want Tiny, use .txt – you want formated, use RTF, you want cross platform portability and formatted, use PDF – or, invent a new XML language like Adobe Mars – but you seem to want a file that you can also EDIT – lets see, you want cross platform, formated AND editability AND tiny – sheesh, what are you smoking (and can I get some!)

    Okay, now that I have had a moment to ‘comment’ – you already guessed what the issue is;

    – is it that the conversion engines are trying to duplicate (down to the pixel) Adobe’s kerning from the PDF source and fails so it just defaults to fragments of text?”

    Yes, and a quick peek at the ‘history’ of file conversion and you note this is no new issue. PostScript was a programing language – it contained all sorts of device control stuff, so it was big and platform dependent. If you wanted to ‘edit’ a PostScript file, you needed to parce it and then add all sorts of code (making it even larger) – So – PDF is invented, and in the case of Adobe Illustrator, you can even save the editing bits ‘inside’ the PDF – so it can be easily parced. Think of it like creating a special kind of tooth pasted with swirls that once squessed out of the tube, was designed to put back into the tube and retain the swirls – i think you get the level of complexity here.

    So, Acrobat (the application) is not editor (well, okay, you can add Enfocus PitStop to modify objects, text) – but you can send PDF objects externally to other editors (to modify vector based objects and text – Illustrator – or images – Photoshop) – but now you have some pretty heavy applications !

    You ‘complaint’ is more to do with the editing part – one can make a pretty tiny PDF, but if you want retain and preserve all the little bits that let an application edit it, well, the PDF then becomes something else – instead of tiny portable, it becomes larger editable.

    So, you you seem to want is both, and sorry, this is not a file format issue, it is a ‘what Adobe has discovered most people want in a PDF – portability, editability and functionality (like when you have a form with a validation script)

    I nearly laughed out loud when you said “PNG” – hey, that is the total opposite of PDF – try exchanging text to be editied using THAT.

    Finally, this comment was the most confusing to me — “These are the types of modifications that I’d like to do to PDFs every once in awhile and it’s why I consider PDF more of a closed “publishing” format than a true collaborative document format.”

    So, i read this over and over, and you start by saying “I want file that retains formating and allows me to edit it” – and then – you seem to imply that “but PDF, i can’t figure out how to open the PDF as code and edit it in some text editor , so I feel like it is ‘closed’ — humm – so, either PDF is ‘closed’ because you can’t figure out how to parce it and edit it (hint!), or (perhaps) PDF is “open” if some developer other than Adobe offers you some free tool that lets you edit it ?

    Your type kills me!

    1. Okay, so, PDF is open – Adobe has published the specification and it is available for free. Adobe has never ever required anyone to ‘license’ the PDF specification, and has allowed – even encouraged – any and all developers to build tools that create, parce, convert, edit and generate PDF entirely unrestricted of licensing or fees.

    Evidence ? well, for one…

    http://www.bcltechnologies.com/

    2. PDF (Version 1.7 of the PDF Specification) is now an ISO standard (ISO 32000 Standard (DIS) – so, not sure what more can be said about “open”.

    http://blogs.adobe.com/insidepdf/2007/12/iso_ballot_for_pdf_17_passed.html

    3. Hey, if you hate PDF, and don’t mind losing cross platform – you know, where you create an entire ‘open’ ecosystem based on Microsoft Vista only, feel free to chug the koolaid and jump from PDF to XPS. You sound like you might love XPS. In fact, I double dog dare you to convert XPS into all the file formats you tried using something other that products from Adobe. Have fun with that, talk to you in like 400 years or so from now.

    Me, my first love was SGML and 1 bit TIFF *sigh* – but you really need a server technology to make it ‘portable’ and ‘editable’ so, I converted my religion, so, yep, you guessed it, i proclaim a jihad and you complaints of Acrobat !

    Wow, that was fun (for me anyway) good luck with your search for a tiny well formated and editable file that offers a free editor and ‘swiss army knife-like” suite of conversion!

    I sure it is hidden somewhere out there on the net!

  4. @PDF Boy: I think that’s the single largest comment I’ve ever had on a blog post, thanks! 🙂

    As for some of your points:

    cross platform, formated AND editability AND tiny

    I know – why do I hear a certain Rolling Stones tune playing in the background…

    Now, if there was no Acrobat, and we wanted to exchange digital version of the New York Times, what would you use (if Acrobat did not exist?)

    HTML+CSS? SVG? Hey, I’m being serious – why do I hear laughter? 😉

    Hey, if you hate PDF, and don’t mind losing cross platform – you know, where you create an entire ‘open’ ecosystem based on Microsoft Vista only, feel free to chug the koolaid and jump from PDF to XPS. You sound like you might love XPS.

    I thought it was pretty clear in my blog post that I consider formats like HTML or SVG to be a good end goal since they are cross-platform and no one vendor controls those specs. But if it wasn’t clear, at least it is now 😉

    You ‘complaint’ is more to do with the editing part – one can make a pretty tiny PDF, but if you want retain and preserve all the little bits that let an application edit it, well, the PDF then becomes something else – instead of tiny portable, it becomes larger editable.

    You’re right – because an ideal document (to me) means something you can 1) read, 2) search, and 3) edit/refine. Since the ‘D’ in PDF stands for “document”, that’s what I feel PDF should be. Maybe that’s just my own personal definition. If you’re saying that to make the PDF editable it requires lots more “stuff” in the doc file, that’s fine and perfectly valid. Perhaps that’s the difference between ‘optimized PDF’ and ‘non-optimized PDF’ ?

    So, i read this over and over, and you start by saying “I want file that retains formating and allows me to edit it” – and then – you seem to imply that “but PDF, i can’t figure out how to open the PDF as code and edit it in some text editor , so I feel like it is ‘closed’ — humm – so, either PDF is ‘closed’ because you can’t figure out how to parce it and edit it (hint!), or (perhaps) PDF is “open” if some developer other than Adobe offers you some free tool that lets you edit it ?

    No, I never complained that I couldn’t open up a PDF in a text editor (you can’t open up a compressed SVG in a text editor either). But you’re right, the terms “open” and “closed” are often used in different and confusing ways. In the sense that I meant it, it was the ability to edit the document or convert it into some other document format that I could edit it. These editors can be “heavy” applications (from what I’ve seen of the conversion tools, they’d have to be), I just don’t know of them or have them (Illustrator is one, I guess?). Ideally I would also like the ability to do this with a free application.

  5. Jason says:

    For what it’s worth, I’ve been using Foxit PDF reader. It’s faster that Adobe and lets you do some annotations and save the document.

    http://www.foxitsoftware.com/pdf/rd_intro.php

  6. Phil says:

    I never really understood why people dislike Adobe Reader. I started with 6, and each version seems to get better. But I never really wanted to edit a pdf. Once I had to mark one up. I printed it, marked it, then scaned it. Now I use the windows journal printer driver and highlight it in windows journal. I can use PDFCreator (another printer driver) to convert it back. I don’t know what the relative file sizes would be though.

  7. Rob says:

    lets see, you want cross platform, formated AND editability AND tiny – sheesh, what are you smoking (and can I get some!)

    To be fair, I don’t think tiny was a requirement. SVG does meet all those requirements, just because the conversion software can’t output clean SVG doesn’t mean the same info couldn’t be produced in SVG in a reasonable size.

    I also take issue with the idea that publishing a specification makes an open standard. There’s a critical point that an open standard is one that’s not written or driven solely by one company. I don’t know about the case of PDF but I do know that though the Flash specification is published, it’s still controlled solely by Adobe. Publishing the specification while controlling what tools that output the document format is not how an open standard works. Again, I know that’s the case for Flash and I know it’s the same company but if you say it’s an ISO standard then maybe somebody else has a say in what goes in to the next version.

    Finally, since I’m on a roll here, nobody invited Microsoft to this party. Just because they make a more messed up document format doesn’t make Adobe’s format better. PDF does a certain job, it fills a certain need, but I’d be much happier if I didn’t need a PDF reader and the headaches that come with it.

  8. Leonard Rosenthol says:

    Another approach you could have taken to get that columnar content out of PDF using Acrobat was to select the text as a table (hold down the alt key) and then right-click to bring up the context menu with “Save as Table” or “Open Table in Spreadsheet”. From there, you can easily edit the content in any program that can handle .csv and then output to the format(s) of your choice.

  9. Why contiguous text is broken up into many elements?

    I think it’s not the tools’ fault, it is the way the PDF was built. I think it came from a LaTeX file.

    And that’s the point: LaTeX tries to make a nice output. So, instead of making one text block per line, or even per word, it moves some characters right or left.

    I observed this behavior in a PostScript file generated from LaTeX a long time ago…

  10. Oh, that was painful to watch, especially when you know how costly these spurious elements are. In a DOM-based system the processing cost increases with the number of elements, linearly at best. This doesn’t usually hurt for HTML, where you usually don’t do so much processing, but it does for the generally more demanding SVG. You will feel the pain in HTML as well when you watch your clever DOM script slows to a halt on a seemingly simple page.

    There are two sometimes conflicting ideals older than both HTML and PDF, visual fidelity (or WYSIWYG) and media independence (or document reuse). PDF is tightly associated with the first ideal. If you want someone else to see exactly what you see shipping a PDF is brilliant. If you want to process it afterwards it is less than ideal.

    HTML started in the other camp. While it wasn’t born an SGML application, SGML was always its fairy godfather, and it is brilliant as a document description language. The WYSIWYG ideal came much later, with the implicit ideal in CSS 2.1 that the visual presentation should be predictable, and given the same conditions interoperable (CSS1 and CSS2 layouts were in principle predictable, but the focus wasn’t on interoperability). I would claim that there is no better document description language than (X)HTML+CSS, and that includes an old favourite DocBook. SVG’s role is somewhere in the middle.

    Handwritten HTML+CSS can be extremely terse, as can handwritten SVG, but editing tools have a harder time auguring the author’s intent, and tend to be defensive, lazy, and redundant. Good code only adds elements when they mean something, but a code optimiser will have problems with ‘span’-filled documents. Even if it is possible to reduce the code to produce the same visual representation the optimiser won’t know if other processors (e.g. a script) need those elements. The end result is that code only becomes more bloated with each step, and only rarely if at all less bloated.

    What I would wish for editors, converters and similar document processors is that their code is external and not added inline in the document itself, much like CSS uses external style sheets (and nice scripts do the same). It should keep the document markup minimal, while the rest can be as messy as needed.

  11. @Leonard: Thanks for the tips here. I think (as François) that the root cause of my problems are the source TeX->PDF conversion to begin with. Though I still think that PDF->xxx converters should be able to be more intelligent.

    @François: Thanks very much for chiming in here. Last night I had a similar thought. Seems that this neither related to PDF or the PDF converters, it’s more an issue with the source file (TeX) being converted into PDF. Nonetheless, a smart converter should be able to tell that the sample HTML code could (should!) be merged into one <span> or <p> element.

    @Jonny: Thanks for your perspective on this. My experience with DocBook and TeX are next to zero, myself…