Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Pieter Marsman
    @pietermarsman
    Turns out one of the two occurrences (the exit of open_filename) could be avoided. It always returns False and that's the same as returning None. I don't want to add the typing-extensions library just for the FloatOrDisabled functionality, so I'm not including the typing-extensions library for now.
    @htInEdin thanks for signalling this issue!
    Pieter Marsman
    @pietermarsman
    To finish it up, I just released 20211012 with many many improvements:

    Added

    Fixed

    Removed

    htInEdin
    @htInEdin
    WooHoo!
    Philippe Ombredanne
    @pombredanne
    @pietermarsman awesome :heart:
    Jeremy Singer-Vine
    @jsvine
    @pietermarsman Looks great! Hurrah, and thanks to you and all the contributors!
    axelskyttner
    @axelskyttner

    Hi everyone,

    Thank you for this project, I really like it. I recently submitted an issue about saving bezier control point information (pdfminer/pdfminer.six#672). What are your thoughts on adding a .path attribute to LTCurve as suggested by @jsvine?

    Al Smith
    @instrument-al
    Hi, all. Forgive me if this isn't the best place to do this—I'm relatively new to programming in general. I'm trying to (among other things) write something to count the number of lines of text on a particular page of a pdf (in order to establish whether it meets particular guidelines). Does pdfminer.six have the functionality to do this? Thanks very much for your help!
    Pieter Marsman
    @pietermarsman
    @axelskyttner I'll be spending more time on pdfminer.six in the upcoming months, but it could take a while before I can have a detailed look at your issue. At first glance it looks like something we could add, but at this moment I can't say for sure.
    @instrument-al That is definitely possible. This SO answer will guide you into the right direction: https://stackoverflow.com/a/69151177/4054971

    @everyone, what are your thoughts on pdfminer/pdfminer.six#651?

    The proposal is to add an extra output format for hOCR. I don't have any experience with hOCR, but at first glance it looks useful. I'm not sure if it should also be part of our package. People can add it easily themselves if they need it. Could be a nice illustration of the composable api in the docs.

    In the end this decision is all about how standard the hOCR standard is, I guess. And how often it will be used.

    Philippe Ombredanne
    @pombredanne
    :P
    no idea what hOCR is :P
    EmanueleGusso
    @EmanueleGusso
    Hi everybody! Thanks for your great work :)
    Do you have any thoughts about the last comment I made here? pdfminer/pdfminer.six#281
    htInEdin
    @htInEdin
    @Pieter, I'm slightly confused, maybe even disagreeing, with the way you've deprecated PDFTextExtractionNotAllowedError. As written, the deprecation warning will be issued iff a pdfminer.six developer edits pdfminer sources to use this class. But what we want, I think, is that user code which imports this class and tries to use it to catch errors it will get the deprecation warning. And that won't happen as things stand. If there is an encryption problem, pdfminer will raise PDFTextExtractionNotAllowed and the user code will not catch it and will die, without ever showing the deprecation warning. What's needed, I guess, is the hack described here: https://stackoverflow.com/questions/40545005/is-it-possible-to-deprecate-an-exception-class, which allows you to raise the deprecation warning when someone imports and catches PDFTextExtractionNotAllowedError...
    Pieter Marsman
    @pietermarsman
    For reference, it is this MR that made the change: pdfminer/pdfminer.six#461
    @htInEdin I agree! It should raise an error when a pdfminer.six user tries to use the class to catch an error.
    deltamacht
    @deltamacht

    Hi all. Thanks for the great work on this package. I'm new but finding it really useful already. I have a simple problem in trying to detect the PDF elements (including vertical text elements). I can read vertical text with no problem using a code snippet like this:

    output_string = StringIO()
    with open('../example_files/nu_colour.pdf', 'rb') as infi:
        parser = PDFParser(infi)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams(detect_vertical=True, all_texts=True))
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
    print(output_string.getvalue())

    However, whenever I try to use PDFPageAggregator instead of TextConverter so that I can get the objects, like so:

    with open('../example_files/nu_colour.pdf', 'rb') as infi:
        parser = PDFParser(infi)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=LAParams(detect_vertical=True, all_texts=True))
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
            layout = device.get_result()
            for element in page_layout:
                print(element)

    I'll capture Horizontal text boxes (as well lines, rects, etc.) but I won't capture the vertical text. Is there a way for me to capture the vertical text at the object hierarchy level so that I can inspect it's position?

    deltamacht
    @deltamacht
    Hmm, hacking into the code I can tell that the TextConverter object knows about the LTTextLineVertical objects that I'm after, so perhaps I could hack that class to put those somewhere for me to process. But still trying to figure out why the second approach doesn't work to get these objects. Similarly, using the high-level function extract_pages doesn't capture these vertical text lines even when detect_vertical is set to True. I'll keep hacking at it and see if I can't find where the disconnect is happening.
    deltamacht
    @deltamacht
    Finally figured it out. I didn't realize LTFigures could be parents of text objects.
    Pieter Marsman
    @pietermarsman
    :thumbsup:

    I just created this pdfminer/pdfminer.six#727 to publish the package automatically to PyPi if a v* git tag is added to a commit.

    In my opinion this makes the development branch obsolete since you can always use the git tags to find the latest release. My proposal would be to remove the development branch and merge all PR's directly into master. Any other opinions?

    Ulan Aitbai
    @Ulxn_gitlab
    Sheeeeeeeeesh, what a chat
    W.P. McNeill
    @wpm
    Search for the file PMC1064074.pdf online. (It's a medical research paper with the title "The Na+–H+ exchanger-1 induces cytoskeletal changes involving reciprocal RhoA and Rac1 signaling, resulting in motility and invasion in MDA-MB-435 cells".) On page 3 there is a photo caption that begins with the text line "Efficiency of transfection with Rho family constructs. MDA-MB-435 cell". If you run pdf2txt.py on this document, it erroneously prints out the first part of this line twice. (Run pdf2txt.py bcr922.pdf | grep -A 3 -B 3 "Efficiency of transfection with Rho".) The first two LTTextLineHorizontal children of the relevant LTTextBoxHorizontal are "'Efficiency of transfection with Rho family constructs\n'" and "Efficiency of transfection with Rho family constructs. MDA-MB-435 cell \n", i.e. the same text is erroneously being assigned to two different lines. Should I create a pdfminer.six issue for this, or is it known that mal-formed PDFs will sometimes have these issues and there's nothing PDFMiner can do about it? (I'm using pdfminer.six version 20220319.)
    Pieter Marsman
    @pietermarsman
    Yep, creating an issue is the way to go!
    Nebukadneza
    @Nebukadneza:ghostdub.de
    [m]
    @jsvine hi there — i saw your reply in the issue. would you rather like the file to test yourself, or shall i try to test your "blindly written" fix locally?
    Nebukadneza
    @Nebukadneza:ghostdub.de
    [m]
    nevermind … it works 🎉 thanks a lot!
    Florian Apolloner
    @apollo13
    Hi lovely folks. I am using pdfminer as part of paperless and it fails to extract text (https://dpaste.org/XXRWU/raw). This is (uhm) expected since the document is encrypted :) That said it doesn't have a password but the pdf standard encryption password and readers like okular, evince, firefox etc display it just fine. So I am wondering if you are open to add support for pdf documents with enabled security? If yes I'll see about opening a ticket and adding code.
    Florian Apolloner
    @apollo13
    I see that pdfminer has support for passwords, will try if encryption via the standard pdf password is doable. If yes it might be worth to try this by default for encrypted blocks?
    Florian Apolloner
    @apollo13
    Ok, it also has the standard password, so something somewhere on the way goes wronf
    Florian Apolloner
    @apollo13
    Davi Sales Barreira
    @davibarreira
    Hello, everyone. I just found out about this amazing package! I'm trying to parse a PDF in order to identify the content structure, I mean, things like "Section 1 [ Section 1.1 [ Text content her ] , Section 1.2 [Text here again] ]] " .
    Is this feasible with pdfminer.six?
    I've read the docs, but there wasn't an example for such task.
    Florian Apolloner
    @apollo13
    (asuming the pdf has a TOC)
    Tamara
    @TamaraAtanasoska
    Hi @apollo13, how would someone find the paragraphs/rest of the content "belonging" to each title the easiest, adding to this script? If one just shows the hirearchy recursively, going through layout element objects increasing the depth, there is no access to this information that identifies what is a title like here (and it's found in the PDFDocument object). Is there any simple way to just add to this script that besides the title will also identify which other content belongs to this title?
    Or maybe how would we check the parent of a part of the content has a title, or is a title itself? Is there any convenient way to connect this information to the layout analysis boxes? I see parent references when I am inspecting what is in the outlines as gotten from .get_outlines().
    Tamara
    @TamaraAtanasoska
    @apollo13 sorry if this is a question that's too detailed, I just saw that you are also a user and not a maintainer :D maybe this is something you also solved already?
    Florian Apolloner
    @apollo13
    @TamaraAtanasoska I am not sure if a pdf has that structure at all. But I really have no idea about the PDF spec anyways
    Mind you a PDF has a structure that is visible to the human when reading it, but technically it could just be a few blocks of texts with information about where to paint them an no hierachy
    Tamara
    @TamaraAtanasoska
    Yeah, that's absolutely clear and not what I am asking. I am asking how to relate the LT* which are the result of layout analysis in pdfminer.six objects to PDFDocument objects(from which one can extract title information). There are people solving this issue with added annotation to some of the objects like here https://github.com/ChrizH/pdfstructure, but I think for my usecase I can do something simpler. Thanks for your answer anyway. I thought maybe the maintainers had an idea about this already which would reflect in the library design.
    Pieter Marsman
    @pietermarsman
    Hi @TamaraAtanasoska, I'm a maintainer of pdfminer.six, but I don't know the perfect answer to you question. As already mentioned above, a PDF is nothing more than a very elaborate way of describing where to draw text and other objects. Our layout algorithm is a best effort in grouping characters, words, lines and paragraphs into meaningful blobs. Perhaps you could investigate how the layout algorithm works and extend or change it for your use case.
    Tamara
    @TamaraAtanasoska
    @pietermarsman thanks for your asnswer, sorry I was offline for a while. I am indeed doing something that builds on top of the layout algorithm for this, but I am aware it can't be perfect. I have a very general use case, which is definitely not ideal. My plan is to check if the get_outlines function returns anything, and if it does then find the corresponding blobs by looking at what the layout algorithm spits out that would nest undert those titles. If the get_outlines function returns nothing, which is from what I see most cases from the PDFs I have, I will write a longer list of heuristics to compare text and try to recreate some structure. This all won't be perfect, but it is a direction to take.
    Pieter Marsman
    @pietermarsman
    @TamaraAtanasoska Good luck!
    ZackCodes.ai
    @KunalGehlot
    image.png
    Was it made like this deliberately? (lines 712 and 715 of pdfdocument.py)
    ZackCodes.ai
    @KunalGehlot
    The documentation has two sections: How-To and Tutorials, which are the same thing. We should merge them into a single section.
    I’m also planning to add a How-to on extracting text with coordinates using the extract_pages API since it is one of the most popular uses of pdfminer I’ve seen out there, and people are still referring to the old examples in this SO question
    I will also add a FAQ on why some characters are not correctly read.