Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    BobLd
    @BobLd
    yes so it's worth creating an issue in the repo
    don't forget to put an example document where you have the issue
    JPUlisses
    @JPUlisses
    I will thank you, and what about the sections? is there an easy way to extract them?
    BobLd
    @BobLd
    depending on how the pdf document was created, you can have a look at bookmarks
    JPUlisses
    @JPUlisses
    bookmarks sounds good, how do I get those or indexes?
    BobLd
    @BobLd
    let me check
    var hasBookmarks = document.TryGetBookmarks(out Bookmarks bookmarks);
    from what I remember you can access them by pages
    some bookmark contains information about where the text is in the page, but it's not really often the case
    JPUlisses
    @JPUlisses
    if (document.TryGetBookmarks(out Bookmarks bookmarks))
    {
    Debug.Log($"Document contained bookmarks with {bookmarks.Roots.Count} root nodes.");
    }
    is not returning me anything
    BobLd
    @BobLd
    it means your document does not have bookmarks
    JPUlisses
    @JPUlisses
    ah I was hoping bookmarks were automatically generated on PDF creation
    BobLd
    @BobLd
    ahah.. with pdf, never expect anything to be automatic
    if you are creating the pdf yourself and depending what software you use, you can generate them. Word has this option
    JPUlisses
    @JPUlisses
    I am afraid I have to go with text paragraph + some sort of small entity recognition for generic section names on papers.
    BobLd
    @BobLd
    if you want to go this way, I'd recommend Machine learning.. this is how difficult this task is
    I did some tests with Yolo3 here https://github.com/BobLd/YOLOv3MLNet
    JPUlisses
    @JPUlisses
    I must say I am having fun experimenting pdfpig I wish I found this earlier as I am doing a PhD that could use this
    BobLd
    @BobLd
    very interesting! if you want, I've created a repo with some resources here https://github.com/BobLd/DocumentLayoutAnalysis
    maybe have a look here if you want to test pre-trained model for this task: https://github.com/BobLd/DocumentLayoutAnalysis#pre-trained-models
    JPUlisses
    @JPUlisses
    thank you so much for sharing
    BobLd
    @BobLd
    what's your PhD about?
    JPUlisses
    @JPUlisses
    it is about using these tools to create a visualization or reports that help researchers skim papers faster or have better information retention or readability through summaries.
    I created some tools but using cermine and python, and although they run and I can build them, I have many issues first with ikvm and then using command lines to run the java and python. although I make it work I have difficulties sharing due the cermine copyleft license.
    so now I found PDFpig, which is in perfect C# scenario and has no gpl 3 license, so I can share builds on the cloud for the data visualizations
    BobLd
    @BobLd
    very interesting! I started contributing to PdfPig with similar personal projet in mind. have a look at UglyToad.PdfPig.DocumentLayoutAnalysis, there are some tools that could be useful to you
    JPUlisses
    @JPUlisses
    originally I wanted to do VR exploration of papers
    but I think that adds time for people that want to skim multiple papers faster
    BobLd
    @BobLd
    haha okay, yes
    JPUlisses
    @JPUlisses
    the tools that you shown I can see some being very useful1
    BobLd
    @BobLd
    and if you want a very very basic classifier, look here https://github.com/BobLd/PdfPigMLNetBlockClassifier
    JPUlisses
    @JPUlisses

    I found another problem.
    So I am using Unity, and with the nuget package I can get PDFPig to run well. However when I build for WebGL I get the error
    "InvalidOperationException: Could not find AFM resource with name: UglyToad.PdfPig.Fonts.Resources.AdobeFontMetrics.Courier-Bold.afm."

    Since I did not had source code I couldn't find the reason and though it was because WebGL was not loading the file properly.

    I put the source files of PDFPig into Unity without errors and to what I believe works well and now I get the same error in the Editor.

    I don't know why this does not happen in the editor with the nuget version. However as I am running in the editor I got a better look at the error.

    UglyToad.PdfPig.Fonts.Standard14Fonts.Standard14.AddAdobeFontMetrics (System.String fontName, System.String afmName, System.Nullable`1[T] type) (at Assets/Scripts/PDFPig/UglyToad.PdfPig.Fonts/Standard14Fonts/Standard14.cs:120)
    Rethrow as InvalidOperationException: Could not load Courier-Bold from the AFM files.

    The file is at \UglyToad.PdfPig.Fonts\Resources\AdobeFontMetrics\Courier-Bold.afm

    I can provide the unity project if needed. Anyone have a clue?

    BobLd
    @BobLd
    I releassed https://github.com/BobLd/PublayNet-maskrcnn-mlnet, an machine learning model to do page segmentation / document layout analysis if of interest for anyone
    marounb98
    @marounb98
    hello , can we extract directly the paragraphs from the pdfDocument ( metadata..) ? I have tried using RegEx but it is not quite efficient, thanks
    BobLd
    @BobLd
    Hi @marounb98, can you give more details about what you are looking to achieve? What paragraphs ans what metadata??
    @EliotJones - I've done some cleaning in the issue and created a new tag for document layout analysis related issues
    MAYUR JANSARI
    @mayurjansari
    is there any way to get token and update them. I want to add rendertoken in existing pdf.
    rjnfrazao
    @rjnfrazao
    Hi All, I am using PdfPig in a NetCore Project. I open the PDF from a MemoryStream fine, the PdfDocument object contains 16 pages, great, however when I get one Page the Text Property is empty "". The tool was working, but when I migrated the documents to Azure File Storage this issue has started. Additional information, the document is the Azure File Storage is perfect, when I download it, I get the correct 16-pages PDF file. Any suggestion on how I can troubleshoot the PdfDocument object instead of the Page object ?
    rjnfrazao
    @rjnfrazao
    I believe my problem is in the way I download the file from Azure File Share ... ShareFileDownloadInfo download = await fileClient.DownloadAsync(); return download.Content;
    rjnfrazao
    @rjnfrazao
    Weird. It's working today. I don't know why. Page.Text is filled in with the correct of the page.
    MAYUR JANSARI
    @mayurjansari
    try some code for rawbyte to jpg but not work. anyone tell any library for it?
    Travis Citrine
    @topcat30:matrix.org
    [m]
    Hi all, I am converting an application from Itext to PDF pig and need some help with a couple of functions:rotating pages by a variable degrees
    the other is filling acroform text fields with data from an xml file. Search the pdf for text fields that match the xml tag name and set the field value.
    voogie
    @theVoogie
    Hi guys, Im using DocstrumBoundingBoxes for finding text boxes and works well but as a result, I get all my text positions inverted
    i.e text box that supposed to appear on top gets positions of text box on the bottom and vice versa
    have anyone had similar experience?
    voogie
    @theVoogie
    Bottom-Up approach ... got it :D
    BobLd
    @BobLd
    @theVoogie, did you manage to get the correct coordinate?
    Leonardo Guerra
    @leonard0guerra
    Is it possible to use PDFPig to fill the PDF (Fillable PDF)?
    Leonardo Guerra
    @leonard0guerra
    Fill Form with pdfpig?