Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    viola voß
    hello rssscrpr-team!
    I'd like to use the service to monitor several web pages that don't provide a RSS feed on their own, like https://www.bibliothek.uni-augsburg.de/sw/rvkbibliographie.html e.g.
    I put this URL in the slot under the "Beispiele: inetbib ..." line and press "Go".
    This creates an error message saying "Couldn't determine what the items are...".
    So I suppose I'm doing something wrong?
    Konstantin Baierer
    Yes, as in @zuphilip's example, you need to use xpaths to determine what the items are (as opposed to the more specific MHonarcScraper for inetbib).
    I'll try to have a look soon, but some websites are just not sensibly structured enough to scrape them with a simple "find the list of items -> find the items" algorithm.
    And yes, a manual and some polishing on the UI certainly would help ... So many ideas, so little time.
    Philipp Zumstein
    I started to collect some examples here in the wiki: https://github.com/kba/rssscrpr/wiki/Examples . @v_i_o_l_a_twitter Feel free to add more examples and whatever you think should be stated in the wiki/manual.
    Konstantin Baierer
    I'm working on getting this deployed more easily as a first step so we have a place to link to.
    viola voß
    @zuphilip thank you for the examples! I think I start to understand the service a bit now. :)
    viola voß
    @kba This search for "usable items" reveals how unstructured some (most?) sites are ... :]
    viola voß

    Some of the pages I'd like to monitor deliver their content in tables. But as the individual columns or cells are not named individually, it's not possible to determine any useful items, or is it?

    Zeitschrift "Dialog mit Bibliotheken", DNB

    Reihe "Berliner Handreichungen", edoc HU Berlin

    Other sites put their content in lists or in plain text. Same problem: No identifiable items, as e.g. "<li>" is also used for the navigation.

    Bibliographie zur RVK, UB Augsburg

    MALIS-Publikationen, TH Köln

    Bibliotheksforum Bayern

    biblos: Online-Ausgaben, ÖNB


    FID Romanistik

    All of these are sites published by libraries. One should think that they should know how to structure data these days ... :)

    The only page I have some "hope" for is this:

    Verband der Bibliotheken des Landes NRW e.V.

    Perhaps here it's possible to use the <div class="news-latest-item-0x"> tag to identify the items?
    The title could be derived from the content of the <h1> tag, and the content of the following <p> are could be the discription.
    Could this be translated to "XPath speech" ? :)

    viola voß
    Thinking about identifiable and unidentifiable items leads me to the question whether it would also be possible to build a scraper that is "less intelligent" and only looks for changes of any kind that have been made to a monitored website. And that delivers only a message like "there has been a change to this website, so go and have a look" or, a bit more intelligtent, like "there has been a change to this website, see the comparion of the "before" and the "after" here". That's what page2rss did, roughly.
    viola voß
    @zuphilip Ah, that looks good! I added a titel and a description for the feed and put it in the list of examples.
    I am looking forward for new entries in the list of examples ;-)
    viola voß
    @zuphilip I added the MALIS <li>s as an example. :)
    viola voß
    thanks to @zuphilip for starting the XPath primer: https://github.com/kba/rssscrpr/wiki/XPath-Primer #helpful :)