This is a channel focused on ScanCode support and not as noisy as the main discuss channel
Looking at the scancode files:
$ head -7 x11.yml
key: x11
short_name: X11 License
name: X11 License
category: Permissive
owner: XFree86 Project, Inc
homepage_url: http://www.xfree86.org/3.3.6/COPYRIGHT2.html
spdx_license_key: ICU
Is this the origin of "your" circular alias?
@maxhbr re:
The LDBcollector allready has these aliases: https://github.com/maxhbr/LDBcollector/blob/generated/aliases/aliases.csv#L686
This would be perfect!
@maxhbr @hesa re:
The x11 / ICU clash comes from the scancode data and was already discussed in https://github.com/maxhbr/LDBcollector/issues/4#
I would rather say that SPDX got it differently possibly not right.
But basically:
x11
and SPDX calls ICU
originated at X11 not ICU. It is mapped to SPDX ICU alright in SC data.x11-xconsortium
and SPDX call X11
is really specific to the X consortium. It is mapped to SPDX X11 alright in SC data.But I reckon this may need some thinking wrt. to use as aliases. The simplest may be to ignore the scancode key in the set of license symbols and only use SPDX and aliases
I am having a bit of trouble understanding Query.unknown_by_pos
. As far as I can tell the query tests/licensedcode/data/datadriven/external/fossology-tests/BSD/lz4.license.txt
matches the rule src/licensedcode/data/rules/bsd-simplified_and_gpl-2.0_1.RULE
exactly (apart from everything after line 7). Yet in the refine_matches
phase (second iteration) it reports the following unknowns_by_pos = defaultdict(<class 'int'>, {43: 0, 41: 0, 15: 0, 21: 0, 20: 0})
, I am particularly surprised by 15
, 20
and 21
. And this is throwing off #2637.
I was under the impression that token
will be considered unknown if token not in query.idx.dictionary
? But it is not that simple?
I'm afraid I have another case that I'm not able to work out. Another one where the ispan
is too inclusive: https://github.com/softsense/scancode-toolkit/blob/issue-2637-allow-license-rules-to-require-the-presence-of-certain-defining-keywords/tests/licensedcode/test_match.py#L325
The ispan
of the match containsSpan(2,22)
, but I feel it should be Span(2,4)|Span(7...)
so that it does not include the key phrase of Span(2,8)
ispan
?
Ah, so as far as the ispan
is concerned all the words are matching, it is not concerned about the extra words in there, that'd be for the qspan
. So what is still missing from the key_phrase_filter
is checking if key_phrase_span
is uninterrupted in the qspan
.
If I would use the zip(qspan, ispan)
to create a query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span)) if ipos in key_phrase_span
to create a Span
offset by where it matches. And then check if that is in query_key_phrase_span in qspan
. Would that reasoning be in the right direction?