Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • May 21 13:40
    lf32 commented #2959
  • May 21 02:05
    JonoYang opened #2974
  • May 20 20:55
    pombredanne closed #2895
  • May 20 20:55

    pombredanne on develop

    (docs) update `--shallow` descr… Merge pull request #2959 from l… (compare)

  • May 20 20:55
    pombredanne closed #2959
  • May 20 20:54
    pombredanne edited #2959
  • May 20 20:53
    pombredanne edited #2959
  • May 20 20:53

    pombredanne on 2971-distroless-system-packages

    (compare)

  • May 20 20:53

    pombredanne on develop

    Convert package data dict to Pa… Convert package data in Package… Merge pull request #2973 from n… (compare)

  • May 20 20:53
    pombredanne closed #2973
  • May 20 20:14
    johnmhoran synchronize #2958
  • May 20 20:14

    johnmhoran on 2945-file-cat

    Add file-cat rule tests Refere… (compare)

  • May 20 19:34
    johnmhoran synchronize #2958
  • May 20 19:34

    johnmhoran on 2945-file-cat

    Rename precise_license_detectio… Update CHANGELOG #2967 * L… Merge pull request #2968 from n… and 1 more (compare)

  • May 20 16:58

    JonoYang on 2967-rename-precise-license-detection

    (compare)

  • May 20 16:58
    JonoYang closed #2968
  • May 20 16:58

    JonoYang on develop

    Rename precise_license_detectio… Update CHANGELOG #2967 * L… Merge pull request #2968 from n… (compare)

  • May 20 16:57
    JonoYang synchronize #2973
  • May 20 16:57

    JonoYang on 2971-distroless-system-packages

    Convert package data in Package… (compare)

  • May 20 14:10
    pombredanne commented #2959
Henrik Sandklef
@hesa
Ah nice
Maximilian Huber
@maxhbr
The x11 / ICU clash comes from the scancode data and was already discussed in https://github.com/maxhbr/LDBcollector/issues/4#
Henrik Sandklef
@hesa
Yes. I am curious if the list of aliases can be used in license-expression (by first adding it to https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses).

Looking at the scancode files:
$ head -7 x11.yml
key: x11
short_name: X11 License
name: X11 License
category: Permissive
owner: XFree86 Project, Inc
homepage_url: http://www.xfree86.org/3.3.6/COPYRIGHT2.html
spdx_license_key: ICU

Is this the origin of "your" circular alias?

Maximilian Huber
@maxhbr
yes
in the SPXD list, X11 and ICU are two independent licenses, and this joins these two. this violates the rule that the main IDs never have clashes with aliases ...
Henrik Sandklef
@hesa
Hmmm... OK :(
Philippe Ombredanne
@pombredanne

If "alias" is already a concept, please show me an example :)

an alias is something in license-expression, but not in scancode. IMHO this would be a list of strings naned aliases

3 replies
Philippe Ombredanne
@pombredanne

@maxhbr re:

The LDBcollector allready has these aliases: https://github.com/maxhbr/LDBcollector/blob/generated/aliases/aliases.csv#L686

This would be perfect!

Philippe Ombredanne
@pombredanne

@maxhbr @hesa re:

The x11 / ICU clash comes from the scancode data and was already discussed in https://github.com/maxhbr/LDBcollector/issues/4#

I would rather say that SPDX got it differently possibly not right.
But basically:

  1. what scancode calls x11 and SPDX calls ICU originated at X11 not ICU. It is mapped to SPDX ICU alright in SC data.
  2. what scancode calls x11-xconsortium and SPDX call X11 is really specific to the X consortium. It is mapped to SPDX X11 alright in SC data.

But I reckon this may need some thinking wrt. to use as aliases. The simplest may be to ignore the scancode key in the set of license symbols and only use SPDX and aliases

This message was deleted
@hesa if you like to join there is a weekly call starting at https://meet.jit.si/AboutCode now
3 replies
Ayan Sinha Mahapatra
@AyanSinhaMahapatra
Philippe Ombredanne
@pombredanne
thx
Aditya Sangave
@adii21-Ux
Hi everyone I am Aditya an undergrad student from India, I just finished setting up my development environment for scancode because I want to contribute to this project and I am good with python, django, html, css and ready to contribute so if there are any issues I can work on to understand codebase please let me know.
Aditya Sangave
@adii21-Ux
Hello I was going through scancode-toolkit documentation and here (https://scancode-toolkit.readthedocs.io/en/latest/getting-started/newcomer.html) I found that there are these three points about scan which are repeated in two section namely Try Scancode Toolkit and Installing Scancode and I don't think its necessary in both topics.
Philippe Ombredanne
@pombredanne
@adii21-Ux good catch :) the doc needs some significant love alright (much more than just a few typos)
Aditya Sangave
@adii21-Ux
should I fix this and open a PR?
Philippe Ombredanne
@pombredanne
@adii21-Ux sure thing, and having something that touches code and not just doc is always welcomed too
Aditya Sangave
@adii21-Ux
ok, I'll make sure if I can do some other changes
Mike Rombout
@mrombout

I am having a bit of trouble understanding Query.unknown_by_pos. As far as I can tell the query tests/licensedcode/data/datadriven/external/fossology-tests/BSD/lz4.license.txt matches the rule src/licensedcode/data/rules/bsd-simplified_and_gpl-2.0_1.RULE exactly (apart from everything after line 7). Yet in the refine_matches phase (second iteration) it reports the following unknowns_by_pos = defaultdict(<class 'int'>, {43: 0, 41: 0, 15: 0, 21: 0, 20: 0}), I am particularly surprised by 15, 20 and 21. And this is throwing off #2637.

I was under the impression that token will be considered unknown if token not in query.idx.dictionary? But it is not that simple?

20 replies
Mike Rombout
@mrombout

I'm afraid I have another case that I'm not able to work out. Another one where the ispan is too inclusive: https://github.com/softsense/scancode-toolkit/blob/issue-2637-allow-license-rules-to-require-the-presence-of-certain-defining-keywords/tests/licensedcode/test_match.py#L325

The ispan of the match containsSpan(2,22), but I feel it should be Span(2,4)|Span(7...) so that it does not include the key phrase of Span(2,8)

Philippe Ombredanne
@pombredanne
@mrombout hey :wave: ...let me check.
Philippe Ombredanne
@pombredanne
@mrombout https://github.com/softsense/scancode-toolkit/pull/1/files#r758425422 you are being misled by the weirdness that may exist in very small indexes
and may be the actual nature of what is in an ispan?
Mike Rombout
@mrombout

Ah, so as far as the ispan is concerned all the words are matching, it is not concerned about the extra words in there, that'd be for the qspan. So what is still missing from the key_phrase_filter is checking if key_phrase_span is uninterrupted in the qspan.

If I would use the zip(qspan, ispan) to create a query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span)) if ipos in key_phrase_span to create a Span offset by where it matches. And then check if that is in query_key_phrase_span in qspan. Would that reasoning be in the right direction?

Philippe Ombredanne
@pombredanne
@mrombout hum... let me think a bit... this is a dense question! :D
Mike Rombout
@mrombout
This message was deleted
What I'm trying right now is:
for qpos, ipos in zip(match.qspan, match.ispan):            
    if ipos in key_phrase_span:
        query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span))
        if query_key_phrase_span not in match.qspan:
            has_key_phrases = False
            break
Philippe Ombredanne
@pombredanne
For reference, we have stopwords and unknown words:
  • unknown words DO NOT exist anywhere in any RULE or LICENSE. They can be seen only in the Query.unknowns_by_pos where we only track how many unknown words exist after a known word position. They are not present in the ispan nor the qspan
  • stopwords exist in RULEs and LICENSEs are short, too common words to be useful. They are skipped both on the index and query side. They are not present in the ispan nor the qspan. They can be seen only in the Query.stopwords_by_pos where we only track how many stopwords exist after a known word position.

by construction, key_phrase_span should be :

  • containing no stopwords
  • containing no unknowns

Therefore, it should be possible to do key_phrase_span in match.qspan.
I reckon the code snippet above is for the next step.

Your snippet looks fine to me :+1:
paraphrasing it, I read it this way:
Philippe Ombredanne
@pombredanne
if a matched rule word position is present in a key phrase (i.e. is the first position of a key phrase), then create a query-side span of positions as long as the key phrase, and check if it exists entirely in the matched query
@mrombout is key_phrase_span for a single key phrase or all the key_phrases of a rule?
Mike Rombout
@mrombout

Yes, that's what I was trying to communicate. So that under the {{Creative Commons Attribution 4.0 International License}} (the "License"); won't match under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The key_phrase_span is a single key phrase.

Philippe Ombredanne
@pombredanne
BTW do you need to keep track of a complete span for a keyphrase? or rather may be you need only its start position and its length? and may be a Rule.keyphrases could be a mapping of {start_position: length} ?

Yes, that's what I was trying to communicate. So that under the {{Creative Commons Attribution 4.0 International License}} (the "License"); won't match under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

yes, remember my comment on your PR... if you have a only one rule in your test index, the univers of unknown words is very large :D

Mike Rombout
@mrombout

That was just a bad attempt to make my example easier to run, the same problem persists when running as a datadriven test (they use the full index right?). Hence why I am taking another look.

https://dev.azure.com/nexB/scancode-toolkit/_build/results?buildId=5135&view=logs&jobId=58aae7ae-0fb4-5f3e-3a0c-bb9f2080987e&j=58aae7ae-0fb4-5f3e-3a0c-bb9f2080987e&t=8cf9f4a1-3731-59c7-8acb-d9b49562ab2c

Philippe Ombredanne
@pombredanne
checking the test and matched rule texts...
You should expect a single cc-by-nc-sa-4.0 match IMHO
Philippe Ombredanne
@pombredanne
BTW this other thing from Amazon is totally fubar https://github.com/awsdocs/aws-net-developer-guide
it is both under cc-by-sa and cc-by-nc-sa which are completely different puppies
@mrombout In you failing test above you are likely missing a rule
Your new filtering code ... does filter! and it exposes that we are missing some rules there
(which makes me think of a new refinement.... IMHO you might want to avoid filtering later not only when there is only one match... but when there is only one match in a given region possibly a query run? ... just a thought)
Philippe Ombredanne
@pombredanne
@mrombout why have you closed your PR at softsense/scancode-toolkit#1 ? also do not forget to merge in softsense/scancode-toolkit#2 IMHO