Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Chin Yeung
    @chinyeungli
    Something you also need to think about:
    • If none of the files in a directory are covered, we only want to return the directory reference instead of all files under the directory.
    • If all files in a directory fall under the .aboutignore (as you porpose) such as a directory only contains .html and .jpg, we don't want to return any undocumented warning for this directory
    Chin Yeung
    @chinyeungli
    and btw, we already have code to get all the ABOUT file in util.py
    Sarthak
    @srthkdb
    @chinyeungli Thanks for the feedback! I have noted the points for command line input, output and .aboutignore.

    I think these steps should take care of the aforementioned points:

    ----------Prepare list of files/dirs to be checked------------------------------

    1. Get files and dirs in root using get_locations() and get_about_locations() in util.py
    2. Filter files and dirs in .aboutignore
    3. If no files/dirs, return no warning. This takes care when all files in a directory fall under the .aboutignore

    --------Find documented files/dirs----------------------------------------------

    1. List all documented files and dirs. Something like:
      for about in about_files:
      documented.append(about["about_resource"])

    ----------Find partially documented directories------------------------------

    1. Find partially_documented directories inside and including root. Partially_documented dir is a dir which is not itself documented but has documented children/child. These dirs can be found by including the parent directories of documented files/dirs within the root, which are not in the documented list. Partially_documented can be a 'set' to avoid duplicates.

    ---------Find undocumented directories---------------------------------------

    1. In order of hierarchy (starting from root), if dir is not in documented or partially_documented, add the dir to undocumented, and skip all its child dirs. Another approach could be to add all directories and later remove child directories. In either case, we do not want to include children of an undocumented dir in the list.

    --------Find undocumented files-------------------------------------------------

    1. Add undocumented files in each partially_documented dir to the undocumented list.

    ------Return results------------------------------------------------------------------

    If a directory is not documented, but it contains a subdirectory which is documented, should it be shown as partially documented or undocumented in the results? I mean is there a need to differentiate between partially documented and documented files in the results.
    Sarthak
    @srthkdb
    Also, I don't understand the what exactly will the utility/option to convert spdx license key to scancode/aboutcode key do?nexB/aboutcode-toolkit#413
    How will the user provide input?
    What I understand so far is we will include a field spdx_license_expression which will contain an SPDX license expression.
    Will this field be optional or will it always be provided?
    Also, will we deal with licenses not known to SPDX in the same way ScanCode does (using LicenseRef)?
    I also have a very basic doubt, which I cannot find the answer to online. Are SPDX keys and ids the same things or different?
    Sarthak
    @srthkdb
    Are we trying to generate a field spdx_license_expression as a companion to license_expressions? Or is it something like the user will provide SPDX identifiers as input and the utility will convert it into scancode/aboutcode key ?
    Sarthak
    @srthkdb
    I also wanted to ask about the code refactoring and performance enhancements part of the project. What exactly does this involve?
    Michael Herzog
    @mjherzog
    @srthkdb SPDX license identifiers and keys are the same thing. In general we would probably want both license_expression and spdx_license_expression because we cross-reference these - they only very rarely have the same value
    Chin Yeung
    @chinyeungli
    @srthkdb I am not sure about the partially documented part.. Either the directory is undocumented or partially documented, we still need to return the part that is not documented.. what is an usage that partially documented is needed?
    Chin Yeung
    @chinyeungli
    The general idea for nexB/aboutcode-toolkit#413 is that users may have their own OSS component usage input (spreadsheet CSV) for project/product which they may use SPDX license value for it.
    We will want to encurage user to use AboutCode TK without much effort to change their input, and that's why we have an idea to provide tool to convert SPDX license to scancode's license to work with AboutCode TK

    I also wanted to ask about the code refactoring and performance enhancements part of the project. What exactly does this involve?

    For these, if you take a closer look to the code, there are many "FIXME" items and maybe some redundant code. Although the code works, the implementation or algorithm can be improved and need some attention.

    Sarthak
    @srthkdb
    @chinyeungli @mjherzog Thanks!

    For these, if you take a closer look to the code, there are many "FIXME" items and maybe some redundant code. Although the code works, the implementation or algorithm can be improved and need some attention.

    Do I need to specify what all FIXME I change or improve in my proposal, or can I leave to code refactoring for now?

    Sarthak
    @srthkdb

    @srthkdb I am not sure about the partially documented part.. Either the directory is undocumented or partially documented, we still need to return the part that is not documented.. what is an usage that partially documented is needed?

    That was just a way to find undocumented files. That way, I will know which directories' files I have to loop through to return as undocumented. Finally, I will return only documented and undocumented

    Sarthak
    @srthkdb

    The general idea for nexB/aboutcode-toolkit#413 is that users may have their own OSS component usage input (spreadsheet CSV) for project/product which they may use SPDX license value for it.
    We will want to encurage user to use AboutCode TK without much effort to change their input, and that's why we have an idea to provide tool to convert SPDX license to scancode's license to work with AboutCode TK

    So basically, the user will provide an SPDX expression as input and the tool will generate the corresponding ScanCode license expression?

    Sarthak
    @srthkdb
    I can't come up with a solution to the SPDX issue. I am new to open source and licenses. It would be really helpful if you could point me to some resources so that I can better understand what exactly I have to do?
    Chin Yeung
    @chinyeungli

    @srthkdb I think you should begin to work/improve your proposal, but of course, you are welcome to do some coding as well..

    That was just a way to find undocumented files. That way, I will know which directories' files I have to loop through to return as undocumented. Finally, I will return only documented and undocumented

    Right. That's why I am uncertain the point to identify partial documented dirs. i.e. the tool loops thru the project, when it sees an ABOUT file reference a dir, then that dir is documented and the tool will break the loop and keep going with others dirs. If it's in a dir which no ABOUT file is referencing, then it will check if files in that dir have been documented, if not, then return the undocmented files (if all files in a dir are undocumented, then return the dir as undocumented)

    In another word, I am uncertain the necessary to know if a dir is partially doc or not...

    Maybe I am missing something, a use case sample will be best to explain :D

    Chin Yeung
    @chinyeungli

    So basically, the user will provide an SPDX expression as input and the tool will generate the corresponding ScanCode license expression?

    SPDX refers to https://spdx.org/
    some people/companies are already using the SPDX license to document their software packages.
    SPDX License: https://spdx.org/licenses/

    Our goal here is create a tool to automatically convert SPDX license/Identifier (in a csv/json) to match with our scancode license
    For instance,
    Apache-1.1 -> apache-1.1
    BSD-2-Clause -> bsd-simplified
    etc...

    Sarthak
    @srthkdb

    @chinyeungli The tool will first find all documented files/dirs and using it will find partially documented dirs as defined above. Now, all directories which are not in both documented and partially documented list will have all the files undocumented. So we return the topmost directories in those cases. This way, we got all the undocumented directories. Now, for partially documented dirs, we loop through it's files and return undocumented files. This way, we get all the undocumented files.

    For example, consider
    /project/hello.c
    /project/hello1.java
    /project/dir1
    /project/dir1/foo.c
    /project/dir1/dir1.ABOUT
    /project/dir2/abc.java
    /project/dir2/aaa.c
    /project/dir2/aaa.ABOUT
    /project/dir3/example.c
    /project/dir3/example.java
    /project/dir3/dir4

    Now, documented = [/project/dir1/, /project/dir2/aaa.c]
    using documented, we get
    partially_documented = [/project/, /project/dir2/]
    Using partially_documented and documented, we get undocumented directories (not present in both)
    undocumented = [/project/dir3/] (dir4 is also in the list, be we already included it's parent dir)
    Now, we loop thorough partially documented dirs and get
    undocumented = [/project/dir3/, /project/hello.c, /project/hello1.java, /project/dir2/abc.java]

    One more thing which needs to be taken care of is, while looping through partially_documented dirs, if we find no undocumented files, then we include the partially documented dir to the documented list.
    Eg, If /project/dir1/dir1.ABOUT documented only foo.c, then /project/dir1/ would be included in the partially documented list, but it is documented directory.

    @chinyeungli So all I need to do is just map the keys. Will I use Dejacode api for this?
    Sarthak
    @srthkdb
    I have drafted my proposal. Please review it and give your comments. https://docs.google.com/document/d/1m7_oBqkVMCM2w0ca66IV3dO5BdFlaXOoQxZsnZLweyw/edit?usp=sharing
    Chin Yeung
    @chinyeungli
    @srthkdb okay
    Sarthak
    @srthkdb
    @chinyeungli Thanks!
    Sarthak
    @srthkdb
    @chinyeungli Should I modify transform to not treat about_resource and name by default as required fields?
    Chin Yeung
    @chinyeungli
    @srthkdb Sorry, I don't get your point. Are you saying the input for transform need to have about_resource and name fields present or otherwise the transform doesn't work?
    Sarthak
    @srthkdb
    @chinyeungli Not input, but the output JSON/CSV should have about_resource and name otherwise transform doesn't work
    Chin Yeung
    @chinyeungli

    @srthkdb AR! I understand what do you mean now. well, the logic here for transform is to transform the data to what we want and work as an input for AboutCode TK, and therefore, it'll check if the input contains the necessary required fields for AboutCode TK. In another word, the transform will give error if the essential fields do not present because the transformed data will not work as the input for AboutCode TK.

    Let me put it this way, users want to use the transform as they want to use AboutCode TK, so if the transformed data doesn't work, it may not make sense..

    but I guess you can argue about that.

    @srthkdb what do you think about this? Do you think we should get rid of this early check? Feel free to express your point or perhaps enter an issue ticket for others to discuss.

    Sarthak
    @srthkdb
    @chinyeungli I get this now. IMHO, I feel that we should keep this check for transform. This point came up earlier as well, that whether we should generalize transform for all JSON/CSV, or design it specifically for JSON/CSV files from ScanCode TK and AboutCode TK only. We concluded to keep it limited for output from ScanCode and AboutCode TKs. Hence, I feel we should keep this uniformity. If we still wish the transform to ouput files without essential fields, we can show a warning in the terminal that the output will not work as input for AboutCode TK.

    As @chinyeungli mentioned:
    "First of all, I think we need to spend some time on the objective/design.
    i.e. for the JSON input, do we take all JSON format file as input, OR should we be very strict to only include JSON files generated from AboutCode TK and ScanCode TK?"

    I think that being a part of aboutcode package, it makes more sense to design our input strictly for JSON files generated from AboutCode TK and ScanCode TK, this is because:
    1) Easy to use: We can keep things simpler and faster for users, because if we generalize this for all JSON inputs, we will need the user to input the parent fields for arguments, otherwise, the process of filtering and checking essential columns will become very complicated(as far as I can see). But if we design it particularly for scancode for instance, we know that we need to look inside files. This will make the input on user's side much easier, for eg, instead for writing something like

    column_filters:
      -for "files"
         -"type"

    User can just write

    column_filters:
     - "type"

    2)Include specific features aimed for these particular files: Generalizing will also remove built in specific features such as checking for "about_resource" and "name" fields, rename "path" to "about_resource".

    3) Code efficiency and performance: Moreover, most of the time transform will be used only for JSON files generated from AboutCode TK and ScanCode TK, hence by making it work strictly on these files, we can write a more efficient code as compared to generalized case.

    These were the points we discussed

    Sarthak
    @srthkdb
    I think there was a misunderstanding :D! Maybe you thought I was saying that input of transform should have essential fields, and that would have been a bug. I was talking about the output of transform and I thought you were saying it would have been better if it did not perform those checks in the output. Now that I've understood, I personally feel that those warnings are helpful for the user because the transform command will be used only to produce input from AboutCode TK.
    Sarthak
    @srthkdb
    I've added this issue: nexB/aboutcode-toolkit#425 for the community to give their opinions.
    Chin Yeung
    @chinyeungli
    @srthkdb thanks for entering the ticket :thumbsup:
    Michael Herzog
    @mjherzog
    I am not sure what this means in terms of the technical design, but somewhere in the transform process we probably need a control file that specifies what the required fields are for a particular implementation. Different organizations could have different definitions of required fields beyond the few essential fields for AbC TK to work.
    Sarthak
    @srthkdb
    @mjherzog Yes, the transform currently supports required_fields which the user can provide in the configuration file, and those additional fields will be treated as essential in addition to the default essential fields, namely about_source and name.
    The doubt here was whether we should allow transform to treat about_source and name as essential fields by default, without the user specifying anything.
    Sarthak
    @srthkdb
    @chinyeungli any final feedback on my proposal will be really helpful! :D
    Chin Yeung
    @chinyeungli
    @srthkdb the proposal looks good.. :thumbsup:
    Philippe Ombredanne
    @pombredanne
    @/all If you are an aspiring GSoC student, please make you made you proposal final ASAP before the deadline :)
    Sarthak
    @srthkdb
    @chinyeungli I would really appreciate your and the community's feedback on my GSOC candidature, so I can work on my shortcomings and perform better next time. Thanks :)
    Chin Yeung
    @chinyeungli
    @srthkdb Thanks again for your interest and contribution. As @pombredanne mentioned in the "discuss" chat, we have received over 100+ applications, and we only have limited number of open slots, so it's a tough call for us, and I can tell your proposal is one of the "serveral excellent proposals" that philippe mentioned.
    Prabal Rawal
    @prabal-rawal
    Hey guys i am new here and looking to contribute to open source.
    Is there any way I can help you guys?
    I know python to some extent and still learning.
    Philippe Ombredanne
    @pombredanne
    @prabal-rawal welcome here too :sun_with_face:
    the key thing is find something that's interesting to you :) what's about our projects that you like?
    Prabal Rawal
    @prabal-rawal
    @pombredanne I like how aboutcode toolkit makes it easier to keep information about the third-party software in the project.
    I have not yet used the app but the info I read in the docs and the website made me interested to join this project.
    I am still a novice who is learning so I don't think I can contribute much to this project but I will still try to do and learn something here.
    Philippe Ombredanne
    @pombredanne
    @prabal-rawal fair enough :) ... may be there are some code tickets that can be of interest to you? @chinyeungli any pointer for a bite-size issue that would be a good start?
    Prabal Rawal
    @prabal-rawal
    @pombredanne what do code tickets mean?
    Philippe Ombredanne
    @pombredanne
    @prabal-rawal we use ticket or issue interchangeably for https://github.com/nexB/aboutcode-toolkit/issues ...
    e.g. an entry in some "tracker" of issues and todos :)
    Chin Yeung
    @chinyeungli
    @prabal-rawal welcome.. as @pombredanne mentioned, you can browse the https://github.com/nexB/aboutcode-toolkit/issues and see if there is any issue ticket that you'll like to work on. Moreover, you can try to play with the aboutcode toolkit and see if there's any issue that you encounter or any suggestion/enhancement that you'll like aboutcode tk to have.