Can LO build a TOC from a PDF file?

classic Classic list List threaded Threaded
5 messages Options
Gilles Gilles
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Can LO build a TOC from a PDF file?

Hello,

This PDF file has no Table of Contents, and I was wondering if LO could grab all the headers and build a TOC.

Thank you.
Jean-Francois Nifenecker Jean-Francois Nifenecker
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can LO build a TOC from a PDF file?

Hello Gilles,

Le 09/07/2017 à 19:20, Gilles a écrit :
> Hello,
>
> This PDF file
> <https://www.legifrance.gouv.fr/download_code_pdf.do?cidTexte=LEGITEXT000006074228&dlType=pdf>
> has no Table of Contents, and I was wondering if LO could grab all the
> headers and build a TOC.

In order to create a PDF with a TOC/index you'll have to set heading
styles to the appropriate paragraphs.

Opening a PDF with LibO won't go anywhere as the tool for that is Draw
which can't set styles for a text processor.

I can't see a way to do that quickly, I'm afraid: a copy/paste from the
PDF document to Writer is possible but you'll have to fix a lot of
things (eg. useless carriage returns) and apply heading styles by hand.
On a 400+ pages document this a big PITA.

Hopefully someone else will come with brighter ideas.


Bien cordialement,
--
Jean-Francois Nifenecker, Bordeaux


--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted

Cley Faye Cley Faye
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can LO build a TOC from a PDF file?

2017-07-09 23:58 GMT+02:00 Jean-Francois Nifenecker <
[hidden email]>:

> Hello Gilles,
>
> Le 09/07/2017 à 19:20, Gilles a écrit :
>
>> Hello,
>>
>> This PDF file
>> <https://www.legifrance.gouv.fr/download_code_pdf.do?cidText
>> e=LEGITEXT000006074228&dlType=pdf>
>> has no Table of Contents, and I was wondering if LO could grab all the
>> headers and build a TOC.
>>
>
> In order to create a PDF with a TOC/index you'll have to set heading
> styles to the appropriate paragraphs.
>
> Opening a PDF with LibO won't go anywhere as the tool for that is Draw
> which can't set styles for a text processor.
>
> I can't see a way to do that quickly, I'm afraid: a copy/paste from the
> PDF document to Writer is possible but you'll have to fix a lot of things
> (eg. useless carriage returns) and apply heading styles by hand. On a 400+
> pages document this a big PITA.
>
> Hopefully someone else will come with brighter ideas.
>
>
>
​You want brighter ideas? Say no more!

So... hmm... I'm afraid there won't be many fully-automated tools that can
build a TOC for you. A PDF basically contains a lot of individual elements,
that are arranged to look like ​something coherent.
From the document you linked, it could theoretically be possible to write a
tool that split every pages, grab the raw text, use a regex to find actual
titles, build a TOC, and inject it in the PDF. This would assume:
- Text extraction works correctly (it's not always the case with PDF)
- Titles always follow the same format

But on this kind of document, you could definitely get some acceptable
results. I experimented a bit. The output is here:
http://www.cjoint.com/c/GGjw0OtPkGc
And for the curious, the "script" I used is here:
https://pastebin.com/icQSZxQr

As you'll see, it is VERY specific to this document, ​but it is possible to
do something.

--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
Gordon Cooper Gordon Cooper
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can LO build a TOC from a PDF file?

In reply to this post by Gilles
   There is a round-about way of doing this using Nuance's PDF
Converter, but
I have not used it since I abandoned Windows® several years ago. With the
PDF Converter, one can make a Word file which could be read by LO, then
use LO's Insert ToC tool and export the result back to PDF.

Gordon

Tauranga N.Z.


On 10/07/17 05:20, Gilles wrote:

> Hello,
>
> This PDF file
> <https://www.legifrance.gouv.fr/download_code_pdf.do?cidTexte=LEGITEXT000006074228&dlType=pdf>
> has no Table of Contents, and I was wondering if LO could grab all the
> headers and build a TOC.
>
> Thank you.
>
>
>
> --
> View this message in context: http://nabble.documentfoundation.org/Can-LO-build-a-TOC-from-a-PDF-file-tp4217910.html
> Sent from the Users mailing list archive at Nabble.com.
>


--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
Gilles Gilles
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can LO build a TOC from a PDF file?

Thanks much everyone. I naively thought it could simply be done by converting the PDF into text in LO, and run a few regexes to build a TOC :-/
Loading...