This is a channel focused on ScanCode support and not as noisy as the main discuss channel
Looking at the scancode files:
$ head -7 x11.yml
key: x11
short_name: X11 License
name: X11 License
category: Permissive
owner: XFree86 Project, Inc
homepage_url: http://www.xfree86.org/3.3.6/COPYRIGHT2.html
spdx_license_key: ICU
Is this the origin of "your" circular alias?
@maxhbr re:
The LDBcollector allready has these aliases: https://github.com/maxhbr/LDBcollector/blob/generated/aliases/aliases.csv#L686
This would be perfect!
@maxhbr @hesa re:
The x11 / ICU clash comes from the scancode data and was already discussed in https://github.com/maxhbr/LDBcollector/issues/4#
I would rather say that SPDX got it differently possibly not right.
But basically:
x11
and SPDX calls ICU
originated at X11 not ICU. It is mapped to SPDX ICU alright in SC data.x11-xconsortium
and SPDX call X11
is really specific to the X consortium. It is mapped to SPDX X11 alright in SC data.But I reckon this may need some thinking wrt. to use as aliases. The simplest may be to ignore the scancode key in the set of license symbols and only use SPDX and aliases
I am having a bit of trouble understanding Query.unknown_by_pos
. As far as I can tell the query tests/licensedcode/data/datadriven/external/fossology-tests/BSD/lz4.license.txt
matches the rule src/licensedcode/data/rules/bsd-simplified_and_gpl-2.0_1.RULE
exactly (apart from everything after line 7). Yet in the refine_matches
phase (second iteration) it reports the following unknowns_by_pos = defaultdict(<class 'int'>, {43: 0, 41: 0, 15: 0, 21: 0, 20: 0})
, I am particularly surprised by 15
, 20
and 21
. And this is throwing off #2637.
I was under the impression that token
will be considered unknown if token not in query.idx.dictionary
? But it is not that simple?
I'm afraid I have another case that I'm not able to work out. Another one where the ispan
is too inclusive: https://github.com/softsense/scancode-toolkit/blob/issue-2637-allow-license-rules-to-require-the-presence-of-certain-defining-keywords/tests/licensedcode/test_match.py#L325
The ispan
of the match containsSpan(2,22)
, but I feel it should be Span(2,4)|Span(7...)
so that it does not include the key phrase of Span(2,8)
ispan
?
Ah, so as far as the ispan
is concerned all the words are matching, it is not concerned about the extra words in there, that'd be for the qspan
. So what is still missing from the key_phrase_filter
is checking if key_phrase_span
is uninterrupted in the qspan
.
If I would use the zip(qspan, ispan)
to create a query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span)) if ipos in key_phrase_span
to create a Span
offset by where it matches. And then check if that is in query_key_phrase_span in qspan
. Would that reasoning be in the right direction?
for qpos, ipos in zip(match.qspan, match.ispan):
if ipos in key_phrase_span:
query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span))
if query_key_phrase_span not in match.qspan:
has_key_phrases = False
break
unknown words
DO NOT exist anywhere in any RULE or LICENSE. They can be seen only in the Query.unknowns_by_pos where we only track how many unknown words exist after a known word position. They are not present in the ispan nor the qspanstopwords
exist in RULEs and LICENSEs are short, too common words to be useful. They are skipped both on the index and query side. They are not present in the ispan nor the qspan. They can be seen only in the Query.stopwords_by_pos where we only track how many stopwords exist after a known word position. by construction, key_phrase_span
should be :
Therefore, it should be possible to do key_phrase_span in match.qspan
.
I reckon the code snippet above is for the next step.