Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
Philippe Ombredanne
@pombredanne
BTW do you need to keep track of a complete span for a keyphrase? or rather may be you need only its start position and its length? and may be a Rule.keyphrases could be a mapping of {start_position: length} ?

Yes, that's what I was trying to communicate. So that under the {{Creative Commons Attribution 4.0 International License}} (the "License"); won't match under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

yes, remember my comment on your PR... if you have a only one rule in your test index, the univers of unknown words is very large :D

Mike Rombout
@mrombout

That was just a bad attempt to make my example easier to run, the same problem persists when running as a datadriven test (they use the full index right?). Hence why I am taking another look.

https://dev.azure.com/nexB/scancode-toolkit/_build/results?buildId=5135&view=logs&jobId=58aae7ae-0fb4-5f3e-3a0c-bb9f2080987e&j=58aae7ae-0fb4-5f3e-3a0c-bb9f2080987e&t=8cf9f4a1-3731-59c7-8acb-d9b49562ab2c

Philippe Ombredanne
@pombredanne
checking the test and matched rule texts...
You should expect a single cc-by-nc-sa-4.0 match IMHO
Philippe Ombredanne
@pombredanne
BTW this other thing from Amazon is totally fubar https://github.com/awsdocs/aws-net-developer-guide
it is both under cc-by-sa and cc-by-nc-sa which are completely different puppies
@mrombout In you failing test above you are likely missing a rule
Your new filtering code ... does filter! and it exposes that we are missing some rules there
(which makes me think of a new refinement.... IMHO you might want to avoid filtering later not only when there is only one match... but when there is only one match in a given region possibly a query run? ... just a thought)
Philippe Ombredanne
@pombredanne
@mrombout why have you closed your PR at softsense/scancode-toolkit#1 ? also do not forget to merge in softsense/scancode-toolkit#2 IMHO
Mike Rombout
@mrombout
I interactively rebased so softsense/scancode-toolkit#2 couldn't merge so I just cherry picked it into my branch and created nexB/scancode-toolkit#2773, I am hoping to be done with this soon as I'm running over budget ;)
Philippe Ombredanne
@pombredanne
ah you moved it there ...... nexB/scancode-toolkit#2773 :+1: and merged @dd-jy PR .. :+1:
excellent
Aditya Sangave
@adii21-Ux
Hello @pombredanne I am trying to work on nexB/scancode-toolkit#2767 correct me if I am wrong I have to imporve the license rule for given license which is lgpl2.0 right?
Philippe Ombredanne
@pombredanne
@adii21-Ux hey, yes, the resolution could go a few different ways!
I added a few comments in https://github.com/nexB/scancode-toolkit/issues/2767#issuecomment-987735258
Aditya Sangave
@adii21-Ux
hello @pombredanne added a comment under nexB/scancode-toolkit#2767 please checkout
Salt
@salt:sal.td
[m]
Yay, finally, the correct place to ask ScanCode questions. I'm pretty excited about this project and am incorporating it into two research projects that are dealing with license detection. However, I'm running into memory issues and it keeps getting chomped by OOM killer. Could someone suggest a process to identify where the issue is taking place? I've already tried setting --max-in-memory to use disk-caching...
Salt
@salt:sal.td
[m]
Or perhaps there's some way to detected when it is killed and re-start the scan from that point forward?
Salt
@salt:sal.td
[m]
bleh, the reaper keeps coming. Most scancode runs are ~15% but then it spikes to 95+ and gets killed.
Philippe Ombredanne
@pombredanne
@salt:sal.td hey :wave:
How much ram do you have on hand?
it uses roughly one GB per process (and this is mostly static usage for the index, but unfortunately not memory-mapped hence not shared between processes)
and it then needs RAM to assemble the final output
This part may be the most memory hungry
which output format do you use?
the jsonlines has been designed for a smaller footprint as things do not nee to be all loaded in memory to create the output
Can you paste your scan cli args details?
Philippe Ombredanne
@pombredanne
@salt:sal.td I am intrigued by your research projects too! tell me more :)
Salt
@salt:sal.td
[m]
@pombredanne: must be the final output that is crashing things then. I'm using json-pp and sending that to elasticsearch. I keep upping the ram to a virtual machine, but it's crashing out at 6gb. Will give cli details and such later, in a meeting but wanted to respond :)
Philippe Ombredanne
@pombredanne
@salt:sal.td sure thing. If you want you could file an issue so we can track this in details there
Abhishek Kumar
@Abhishek-Dev09
@/all πŸ‘‹
Philippe Ombredanne
@pombredanne
@Abhishek-Dev09 hey :wave:
'sup?
Abhishek Kumar
@Abhishek-Dev09
@pombredanne Hi , how it is going?
Ayan Sinha Mahapatra
@AyanSinhaMahapatra
:wave:
Philippe Ombredanne
@pombredanne
@Abhishek-Dev09 doing great and you?
1 reply
dwdanielo
@dwdanielo
Hello everyone,
sorry if this is not the right place to ask this but I'm having some toubles with scancode while excluding some files. I've tried both --ignore-author and --ignore-copyright-holder like in this example:
https://scancode-toolkit.readthedocs.io/en/latest/cli-reference/output-filters-and-control.html?highlight=ignore
but none of the above worked for me.
How can I skip one folder in my repository, is there a way to do that with adding the parameter while executing the command, or should I use separated config file to specify particular directories I want to scan? I'll just add that it's not about skipping the unwanted files in post scan activities, I need to skip the unwanted folder from skanning as it will save 10h(!) of my time.
Thanks in advance, any help will be appreciated!
Philippe Ombredanne
@pombredanne
@dwdanielo the --ignore "glob pattern" option should be what you need
@dwdanielo this is the right place BTW... and welcome! :wave:
@dwdanielo the doc surely could be improved... so please come back here to tell if this worked for you
dwdanielo
@dwdanielo

Thanks so much for the answer :) Unfortunately I still can't exclude the unwanted folder. I generated the glob pattern like in below photo using https://regex101.com/ but nothing changed, the scancode scanned all the content from the path.
Just to show better what I've tried, here is my structure:

C/
β”œβ”€ workspace/
β”‚ β”œβ”€ UNWANTED/
β”‚ β”œβ”€ folder_1/
β”‚ β”œβ”€ folder_2/
β”‚ β”œβ”€ folder_n/
β”‚ β”œβ”€ file_1
β”‚ β”œβ”€ file_2
β”‚ β”œβ”€ file_n

and the command:
C:\workspace>scancode --ignore "./UNWANTED/." -l --html C:/scan_log.html C:/workspace

I've also tried the --ignore "./UNWANTED/." as last parameter, also with r before glob pattern but nothing changed...
Maybe I miss some basic stuff?

image.png
Philippe Ombredanne
@pombredanne
@dwdanielo let me try this locally
dwdanielo
@dwdanielo
I have dealt with it different way, I created a python script that scans every folder separately(except UNWANTED), the result is that the script generates many reports, but then I merge every html report into one, just wanted to share my solution ;)
Philippe Ombredanne
@pombredanne
@dwdanielo ah :) thanks!
Aditya Sangave
@adii21-Ux
nexB/scancode-toolkit#2872 @pombredanne added a comment
Rahul Surwade
@RahulSurwade08
Hey Everyone!
My Name is Rahul Surwade. I am a Cloud Security Engineer by profession with knowledge in AWS, GCP, Linux, Docker, Terraform and Kubernetes. I am new to open source contribution. I am very excited to contribute in your GSoc 2022 projects and learn a lot during my tenure here. I am currently working my way towards learning Go and GitOps.
I am interested for contributing in container-insector Project.
Please feel free to reach out on Linkedin : www.linkedin.com/in/rahul-surwade
Philippe Ombredanne
@pombredanne
@RahulSurwade08 welcome!