Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    utgheith
    @utgheith
    I ran a few more experiments. The following configuration fails (as above) about 2% of the time:
    file system: ext4
    uname -a : Linux linux3 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
    java -version: openjdk 11.0.10 2021-01-19
    Same configuration with ZFS finished 5000 iterations without failure.
    MacOS (11.2.3), openjdk version "15.0.2" 2021-01-19 also run ~5000 iterations without failure
    I will track this down and find the root cause but I wanted to confirm the expected behavior with you first.
    utgheith
    @utgheith
    It would be great if you took a look at the test case and confirmed that is has the correct expectations and I'll take it from there.
    Li Haoyi
    @lihaoyi
    your expectations are correct
    it should receive events in thhe order of deletion, which is children first before parents
    utgheith
    @utgheith

    Question 1: consistency between MacOS and Linux. Removing the watched directory (an obvious corner-case) behaves differently on MacOS and Linux. What is the expectation? Maybe it should be documented as undefined behavior?

    [Sample program] (https://github.com/utgheith/watch/blob/master/src/main/scala/main.scala)

    MacOS

    event
      data

    Linux

    event
      data/a
      data/b
    Li Haoyi
    @lihaoyi
    yeah possibly, we don't use that functionality at work so not surprising that it's inconsistent
    utgheith
    @utgheith

    Here is a stress test that leads to failure in 3 native environments: MacOS, Ubuntu/ext4, Ubuntu/ZFS. There seem to be 2 separate issues:

    (1) a race condition that causes the watch thread to die with FileNotFoundException.
    (2) events get reordered leading to premature watch cancelation on some subdirectories

    utgheith
    @utgheith
    I see the race condition leading to the first issue and the solution appear to be simple so far.
    The second problem requires a bit more thought. Ideas: (1) delay event processing and order delete events (reduces the probability of failure), (2) generate synthetic delete triggers for missing directories (will generate more events than expected and doesn't handle missing file deletes), (3) Don't use WatchService (not obvious how it would help), ... I'll keep digging.
    utgheith
    @utgheith
    Would it be acceptable for os.watch.watch to produce more events than strictly promised (with low probability)?
    Li Haoyi
    @lihaoyi
    for my use case, sure
    it doesn't really have a rigorous spec lol
    utgheith
    @utgheith
    If that's acceptable then there is a simple fix: the implementation os.watch.watch keeps track of all registered paths and refreshes the list on each event. We can generate a synthetic DELETE event for any missing entries.
    Li Haoyi
    @lihaoyi
    i think it sounds reasonable
    maybe put up a PR describing the problem and solution in more detail?
    utgheith
    @utgheith
    Implications:
    (1) only works for directories but I've empirically noticed that WatchService only drops directory DELETE events (at least in the stress tests I tried)
    (2) might surprise applications that are not prepared for the extra DELETE events
    (3) would violate the stated behavior for moved directories (for those who might interpret it as anything but a lower bound)
    (4) would require changing the tests to use set inclusion instead of equality (change in one function)
    Will do. Either way I think it is critical that we get this issue out of the way because it is blocking unrelated pull requests.
    Li Haoyi
    @lihaoyi
    I think duplicate events is probably fine
    utgheith
    @utgheith
    But it would be more than duplicate events in this case. It might generate "unexpected" events. For example: I interpret the following paragraph as a lower bound but someone that interprets it as strict equality might be confused by seeing DELETE events for sub directories of moved directories:
    Once the call to watch returns, onEvent is guaranteed to receive a an event containing the path for:
    
    - Every file or folder that gets created, deleted, updated or moved within the watched folders
    
    - For copied or moved folders, the path of the new folder as well as every file or folder within it.
    
    - For deleted or moved folders, the root folder which was deleted/moved, but without the paths of every file that was within it at the original location
    Li Haoyi
    @lihaoyi
    entirely unexpected events sounds a bit tough
    utgheith
    @utgheith
    But we're already in violation of the 3rd assertion (using the strict interpretation). DELETE events are being generated for deleted sub-folders but not for moved ones.
    Li Haoyi
    @lihaoyi
    yeah
    utgheith
    @utgheith

    I apologize for all the spam but here is a simple test that shows that the problem is in WatchService. os.watch is likely masking some of those issues in its implementation but not all.

    https://github.com/utgheith/watch3/blob/master/src/main/scala/main.scala

    utgheith
    @utgheith

    It seems that readme.md is using travis-ci badges instead of the github actions ones. Clicking on the build status goes back to a January build (telling me that the switch occurred around this time). Tried to submit a pull request that fixes that but it gets stuck and never runs the tests. This seems to be related to an issue that I don't fully understand but requires special privileges to deal with anyway. Here is a link:

    https://github.community/t/expected-waiting-for-status-to-be-reported/16727

    Lorenzo Gabriele
    @lolgab
    @utgheith The build now has a name: ci that's why the statuses don't match. Can you please rebase the PR on top of current master?
    utgheith
    @utgheith
    done
    Li Haoyi
    @lihaoyi
    @utgheith I would not be surprised by bugs in the underlying WatchService, would be nice if we could paper over it though. Not sure hwo hard that would be
    utgheith
    @utgheith
    How about using intotify directly? Similar to what happens on MacOS. But first I need to convince myself that the issues in WatchService are not cause by inotify :-)
    utgheith
    @utgheith
    com-lihaoyi/os-lib#69 is an attempt to paper over it but it makes me uncomfortable.
    Li Haoyi
    @lihaoyi
    using inotify directly is possible, I never thought of it because I thought WatchService generally worked for the stuff we used it for (as opposed on on OS-X, where WatchService does not work at all
    on a large folder due to polling)
    utgheith
    @utgheith
    I will write some stress tests and share the results but I have a few concerns regarding os.watch and I want to know your thoughts on them.
    (1) os.watch is the sort of thing that requires stress testing. Unit testing is still important obviously. Do you have your own stress tests? If not, is there value in developing some and running them as part of CI?
    (2) The API doesn't make any statements regarding OVERFLOW (silently drop events, force the caller to re-examine the world, ...). inotify uses a finite system-wide queue for notifications and there is no guarantee that it will not overflow.
    Li Haoyi
    @lihaoyi
    idont have any stress tests directly on the library; only for downstream projects.
    I actualyl dont know what to do about OVERFLOW events
    utgheith
    @utgheith
    In the one project where I used os.watch, I resorted to periodically restarting the application in an orderly manner. I could imagine writing a loop that reinitializes the world when it receives an OVERFLOW event.
    utgheith
    @utgheith

    Same test as before but uses inotify (not directly but by spawning a process running inotifywatch). Works reliably on the same test that was failing before. This strongly hints that using inotify directly will be more reliable. Also inotify is explicit about how it handles moves (move_to, move_from, move_self, etc)

    https://github.com/utgheith/watch4/blob/master/src/main/scala/main.scala

    Li Haoyi
    @lihaoyi
    @utgheith I think com-lihaoyi/os-lib#69 sounds reasonable; am I right to say that this change is linux-only and won't affect the fsevents-based OSX version?
    if someone doesn't care about the inner events, they can just ignore them
    I'm happy to have a raw inotify-based implementation too, if you don't mind wrangling some JNA FFI code. That's what we ended up doing for the OS-X backend anyway so it's not unprecedented
    Li Haoyi
    @lihaoyi
    w.r.t. the extra events, the only worry I can think of is if moving a large folder results in a huge flood of useless events
    hard to guess whether that would be a problem in practice or not; for our use case at work, it'll probably just take a bit longer to realize it doesn't need to do anything, but no big deal
    utgheith
    @utgheith
    Started an intofiy based Watcher implementation (com-lihaoyi/os-lib#70). It's still rough around the edges and didn't have a change to stress-test it but it already passes the existing os.watch tests reliably (even on Docker). Would be great if you took look at the overall design.
    @lihaoyi yes, the changes will be Linux only either way because OSX uses its own Watcher
    Li Haoyi
    @lihaoyi
    OSX has similar bugs :p also only under heavy load, so havent had a chance to investigate too deeply
    utgheith
    @utgheith
    @lihaoyi your assertion is correct. I added a simple randomized test that creates trees with up to 100 things in it then delete them. The bad news: it fails on both OSX and Linux. The good news: it passes for the inotify based Watcher. I think this will be sorted out.
    Do you have any objections to adding scalacheck as a dependency? It makes working with such tests easier.
    Li Haoyi
    @lihaoyi
    go for it