This is a channel focused on ScanCode support and not as noisy as the main discuss channel
johnmhoran on 2945-file-cat
Add initial (failing) test #294… (compare)
pombredanne on prepare-31b5
Organize imports Signed-off-by… Add new methods to collect pack… Recognize either app or system … and 1 more (compare)
pombredanne on fix-2943-pkg-info-bug
pombredanne on develop
Modify pypi PKG-INFO parse Ref… Merge pull request #2953 from n… (compare)
I am having a bit of trouble understanding Query.unknown_by_pos
. As far as I can tell the query tests/licensedcode/data/datadriven/external/fossology-tests/BSD/lz4.license.txt
matches the rule src/licensedcode/data/rules/bsd-simplified_and_gpl-2.0_1.RULE
exactly (apart from everything after line 7). Yet in the refine_matches
phase (second iteration) it reports the following unknowns_by_pos = defaultdict(<class 'int'>, {43: 0, 41: 0, 15: 0, 21: 0, 20: 0})
, I am particularly surprised by 15
, 20
and 21
. And this is throwing off #2637.
I was under the impression that token
will be considered unknown if token not in query.idx.dictionary
? But it is not that simple?
I'm afraid I have another case that I'm not able to work out. Another one where the ispan
is too inclusive: https://github.com/softsense/scancode-toolkit/blob/issue-2637-allow-license-rules-to-require-the-presence-of-certain-defining-keywords/tests/licensedcode/test_match.py#L325
The ispan
of the match containsSpan(2,22)
, but I feel it should be Span(2,4)|Span(7...)
so that it does not include the key phrase of Span(2,8)
ispan
?
Ah, so as far as the ispan
is concerned all the words are matching, it is not concerned about the extra words in there, that'd be for the qspan
. So what is still missing from the key_phrase_filter
is checking if key_phrase_span
is uninterrupted in the qspan
.
If I would use the zip(qspan, ispan)
to create a query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span)) if ipos in key_phrase_span
to create a Span
offset by where it matches. And then check if that is in query_key_phrase_span in qspan
. Would that reasoning be in the right direction?
for qpos, ipos in zip(match.qspan, match.ispan):
if ipos in key_phrase_span:
query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span))
if query_key_phrase_span not in match.qspan:
has_key_phrases = False
break
unknown words
DO NOT exist anywhere in any RULE or LICENSE. They can be seen only in the Query.unknowns_by_pos where we only track how many unknown words exist after a known word position. They are not present in the ispan nor the qspanstopwords
exist in RULEs and LICENSEs are short, too common words to be useful. They are skipped both on the index and query side. They are not present in the ispan nor the qspan. They can be seen only in the Query.stopwords_by_pos where we only track how many stopwords exist after a known word position. by construction, key_phrase_span
should be :
Therefore, it should be possible to do key_phrase_span in match.qspan
.
I reckon the code snippet above is for the next step.
Yes, that's what I was trying to communicate. So that under the {{Creative Commons Attribution 4.0 International License}} (the "License");
won't match under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
.
The key_phrase_span
is a single key phrase.
Yes, that's what I was trying to communicate. So that under the {{Creative Commons Attribution 4.0 International License}} (the "License"); won't match under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
yes, remember my comment on your PR... if you have a only one rule in your test index, the univers of unknown words is very large :D
That was just a bad attempt to make my example easier to run, the same problem persists when running as a datadriven test (they use the full index right?). Hence why I am taking another look.
cc-by-nc-sa-4.0
match IMHO
--max-in-memory
to use disk-caching...