I think these steps should take care of the aforementioned points:
----------Prepare list of files/dirs to be checked------------------------------
--------Find documented files/dirs----------------------------------------------
for about in about_files:
documented.append(about["about_resource"])
----------Find partially documented directories------------------------------
---------Find undocumented directories---------------------------------------
--------Find undocumented files-------------------------------------------------
------Return results------------------------------------------------------------------
spdx_license_expression
which will contain an SPDX license expression.I also wanted to ask about the code refactoring and performance enhancements part of the project. What exactly does this involve?
For these, if you take a closer look to the code, there are many "FIXME" items and maybe some redundant code. Although the code works, the implementation or algorithm can be improved and need some attention.
For these, if you take a closer look to the code, there are many "FIXME" items and maybe some redundant code. Although the code works, the implementation or algorithm can be improved and need some attention.
Do I need to specify what all FIXME I change or improve in my proposal, or can I leave to code refactoring for now?
@srthkdb I am not sure about the partially documented part.. Either the directory is undocumented or partially documented, we still need to return the part that is not documented.. what is an usage that partially documented is needed?
That was just a way to find undocumented files. That way, I will know which directories' files I have to loop through to return as undocumented. Finally, I will return only documented and undocumented
The general idea for nexB/aboutcode-toolkit#413 is that users may have their own OSS component usage input (spreadsheet CSV) for project/product which they may use SPDX license value for it.
We will want to encurage user to use AboutCode TK without much effort to change their input, and that's why we have an idea to provide tool to convert SPDX license to scancode's license to work with AboutCode TK
So basically, the user will provide an SPDX expression as input and the tool will generate the corresponding ScanCode license expression?
@srthkdb I think you should begin to work/improve your proposal, but of course, you are welcome to do some coding as well..
That was just a way to find undocumented files. That way, I will know which directories' files I have to loop through to return as undocumented. Finally, I will return only documented and undocumented
Right. That's why I am uncertain the point to identify partial documented dirs. i.e. the tool loops thru the project, when it sees an ABOUT file reference a dir, then that dir is documented and the tool will break the loop and keep going with others dirs. If it's in a dir which no ABOUT file is referencing, then it will check if files in that dir have been documented, if not, then return the undocmented files (if all files in a dir are undocumented, then return the dir as undocumented)
In another word, I am uncertain the necessary to know if a dir is partially doc or not...
Maybe I am missing something, a use case sample will be best to explain :D
So basically, the user will provide an SPDX expression as input and the tool will generate the corresponding ScanCode license expression?
SPDX refers to https://spdx.org/
some people/companies are already using the SPDX license to document their software packages.
SPDX License: https://spdx.org/licenses/
Our goal here is create a tool to automatically convert SPDX license/Identifier (in a csv/json) to match with our scancode license
For instance,
Apache-1.1 -> apache-1.1
BSD-2-Clause -> bsd-simplified
etc...
@chinyeungli The tool will first find all documented files/dirs and using it will find partially documented dirs as defined above. Now, all directories which are not in both documented and partially documented list will have all the files undocumented. So we return the topmost directories in those cases. This way, we got all the undocumented directories. Now, for partially documented dirs, we loop through it's files and return undocumented files. This way, we get all the undocumented files.
For example, consider
/project/hello.c
/project/hello1.java
/project/dir1
/project/dir1/foo.c
/project/dir1/dir1.ABOUT
/project/dir2/abc.java
/project/dir2/aaa.c
/project/dir2/aaa.ABOUT
/project/dir3/example.c
/project/dir3/example.java
/project/dir3/dir4
Now, documented = [/project/dir1/, /project/dir2/aaa.c]
using documented, we get
partially_documented = [/project/, /project/dir2/]
Using partially_documented and documented, we get undocumented directories (not present in both)
undocumented = [/project/dir3/] (dir4 is also in the list, be we already included it's parent dir)
Now, we loop thorough partially documented dirs and get
undocumented = [/project/dir3/, /project/hello.c, /project/hello1.java, /project/dir2/abc.java]
One more thing which needs to be taken care of is, while looping through partially_documented dirs, if we find no undocumented files, then we include the partially documented dir to the documented list.
Eg, If /project/dir1/dir1.ABOUT documented only foo.c, then /project/dir1/ would be included in the partially documented list, but it is documented directory.
@srthkdb AR! I understand what do you mean now. well, the logic here for transform is to transform the data to what we want and work as an input for AboutCode TK, and therefore, it'll check if the input contains the necessary required fields for AboutCode TK. In another word, the transform will give error if the essential fields do not present because the transformed data will not work as the input for AboutCode TK.
Let me put it this way, users want to use the transform
as they want to use AboutCode TK, so if the transformed data doesn't work, it may not make sense..
but I guess you can argue about that.
@srthkdb what do you think about this? Do you think we should get rid of this early check? Feel free to express your point or perhaps enter an issue ticket for others to discuss.
transform
. This point came up earlier as well, that whether we should generalize transform
for all JSON/CSV, or design it specifically for JSON/CSV files from ScanCode TK and AboutCode TK only. We concluded to keep it limited for output from ScanCode and AboutCode TKs. Hence, I feel we should keep this uniformity. If we still wish the transform
to ouput files without essential fields, we can show a warning in the terminal that the output will not work as input for AboutCode TK.
As @chinyeungli mentioned:
"First of all, I think we need to spend some time on the objective/design.
i.e. for the JSON input, do we take all JSON format file as input, OR should we be very strict to only include JSON files generated from AboutCode TK and ScanCode TK?"I think that being a part of aboutcode package, it makes more sense to design our input strictly for JSON files generated from AboutCode TK and ScanCode TK, this is because:
1) Easy to use: We can keep things simpler and faster for users, because if we generalize this for all JSON inputs, we will need the user to input the parent fields for arguments, otherwise, the process of filtering and checking essential columns will become very complicated(as far as I can see). But if we design it particularly for scancode for instance, we know that we need to look inside files. This will make the input on user's side much easier, for eg, instead for writing something likecolumn_filters: -for "files" -"type"
User can just write
column_filters: - "type"
2)Include specific features aimed for these particular files: Generalizing will also remove built in specific features such as checking for "about_resource" and "name" fields, rename "path" to "about_resource".
3) Code efficiency and performance: Moreover, most of the time transform will be used only for JSON files generated from AboutCode TK and ScanCode TK, hence by making it work strictly on these files, we can write a more efficient code as compared to generalized case.
These were the points we discussed
transform
should have essential fields, and that would have been a bug. I was talking about the output of transform
and I thought you were saying it would have been better if it did not perform those checks in the output. Now that I've understood, I personally feel that those warnings are helpful for the user because the transform
command will be used only to produce input from AboutCode TK.
transform
currently supports required_fields
which the user can provide in the configuration file, and those additional fields will be treated as essential in addition to the default essential fields, namely about_source
and name
.transform
to treat about_source
and name
as essential fields by default, without the user specifying anything.