This is a channel focused on ScanCode support and not as noisy as the main discuss channel
pombredanne on develop
(docs) update `--shallow` descr… Merge pull request #2959 from l… (compare)
pombredanne on 2971-distroless-system-packages
pombredanne on develop
Convert package data dict to Pa… Convert package data in Package… Merge pull request #2973 from n… (compare)
johnmhoran on 2945-file-cat
Add file-cat rule tests Refere… (compare)
johnmhoran on 2945-file-cat
Rename precise_license_detectio… Update CHANGELOG #2967 * L… Merge pull request #2968 from n… and 1 more (compare)
JonoYang on 2967-rename-precise-license-detection
JonoYang on develop
Rename precise_license_detectio… Update CHANGELOG #2967 * L… Merge pull request #2968 from n… (compare)
JonoYang on 2971-distroless-system-packages
Convert package data in Package… (compare)
Ah, so as far as the ispan
is concerned all the words are matching, it is not concerned about the extra words in there, that'd be for the qspan
. So what is still missing from the key_phrase_filter
is checking if key_phrase_span
is uninterrupted in the qspan
.
If I would use the zip(qspan, ispan)
to create a query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span)) if ipos in key_phrase_span
to create a Span
offset by where it matches. And then check if that is in query_key_phrase_span in qspan
. Would that reasoning be in the right direction?
for qpos, ipos in zip(match.qspan, match.ispan):
if ipos in key_phrase_span:
query_key_phrase_span = Span(qpos, qpos + len(key_phrase_span))
if query_key_phrase_span not in match.qspan:
has_key_phrases = False
break
unknown words
DO NOT exist anywhere in any RULE or LICENSE. They can be seen only in the Query.unknowns_by_pos where we only track how many unknown words exist after a known word position. They are not present in the ispan nor the qspanstopwords
exist in RULEs and LICENSEs are short, too common words to be useful. They are skipped both on the index and query side. They are not present in the ispan nor the qspan. They can be seen only in the Query.stopwords_by_pos where we only track how many stopwords exist after a known word position. by construction, key_phrase_span
should be :
Therefore, it should be possible to do key_phrase_span in match.qspan
.
I reckon the code snippet above is for the next step.
Yes, that's what I was trying to communicate. So that under the {{Creative Commons Attribution 4.0 International License}} (the "License");
won't match under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
.
The key_phrase_span
is a single key phrase.
Yes, that's what I was trying to communicate. So that under the {{Creative Commons Attribution 4.0 International License}} (the "License"); won't match under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
yes, remember my comment on your PR... if you have a only one rule in your test index, the univers of unknown words is very large :D
That was just a bad attempt to make my example easier to run, the same problem persists when running as a datadriven test (they use the full index right?). Hence why I am taking another look.
cc-by-nc-sa-4.0
match IMHO
--max-in-memory
to use disk-caching...