Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 20:33
    dependabot[bot] labeled #4166
  • 20:33
    dependabot[bot] labeled #4166
  • 20:33
    dependabot[bot] opened #4166
  • 18:06
    dependabot[bot] labeled #4165
  • 18:06
    dependabot[bot] labeled #4165
  • 18:06
    dependabot[bot] opened #4165
  • Sep 19 20:29
    mayralgr commented #3394
  • Sep 19 20:21
    mayralgr synchronize #3394
  • Sep 19 03:10
    mayralgr review_requested #3394
  • Sep 19 02:55
    mayralgr commented #3394
  • Sep 19 02:42
    mayralgr synchronize #3394
  • Sep 19 01:26
    mayralgr synchronize #3394
  • Sep 18 10:16
    sfkeller commented #4096
  • Sep 17 18:25
    wetneb closed #4164
  • Sep 17 18:05
    probot-autolabeler[bot] labeled #4164
  • Sep 17 18:05
    dependabot[bot] labeled #4164
  • Sep 17 18:05
    dependabot[bot] labeled #4164
  • Sep 17 18:05
    dependabot[bot] opened #4164
  • Sep 17 16:05
    fgiroud synchronize #4163
  • Sep 17 16:04
    fgiroud synchronize #4163
dmlu
@dmlu:matrix.org
[m]
Hi! When I select "export to quickstatements", I get a blank page (no download occurs). Am I doing something wrong?
got it, I need to select that menu item when I have the data grid loaded
Antoine Beaubien
@antoine2711
Is there a link to get the latest nightly build of OR? Not the v3.4.1, but the current master branch from last night? I'm looking for a download link or a GH page link…
2 replies
Amisha Kumari
@Amishakumari544
hi , I am new here can anyone help me ?? to contribute in this project.
5 replies
Shatakshi Gupta
@Shatakshi0805
Hi everyone. I have been trying to run docs on my local machine through yarn start as stated in the README.md to first git checkout master and then editing and saving the docs and doing yarn start. But after yarn start I am not able to see the site on local host and can only see the OpenRefine logo. I believe that it is due to Java version 13.0.2 that I am using. Can anyone suggest with what lower version(Java) should I be using to make that work? Thanks.
6 replies
Shatakshi Gupta
@Shatakshi0805
image.png
Jez (he/him)
@jez:petrichor.me
[m]
Hey folks πŸ‘‹
What's the practical limit on size of dataset that can be imported into OR? I have a TSV file just short of 2GB, and after increasing the heap size to 8GB I managed to get it importing, but it was taking hours to import so I cancelled it.
The time remaining just increases (I left it for about 90 minutes before giving up last time)
Thad Guidry
@thadg:matrix.org
[m]
Jez (he/him): Our current architecture can typically let you work with approx 1 million rows, but completely depends on the WIDTH or LENGTH of Strings in the dataset. As well as how many columns which increases the size of memory heap needed considerably. You can customize in the import options only those columns that you really need, as well as not checking any of the conversion options (for example, to save memory for processing or transforming which can always be done later after the Project is fully created)
The red color basically says your memory heap is getting garbage collected and will be like that forever.
Can you limit the number of rows (OPTION Load at most X row(s) of data), or use our Split into columns later on once you have a data grid. For example, try to import that TSV file instead as just Line-based text (creating only 1 long string per row), where you can then Split into columns by the TAB character later on once the Project is loaded.
Also don't Attempt to parse cell text into numbers which can also add extra processing (heap memory).
Thad Guidry
@thadg:matrix.org
[m]
Basically the more Cells that OpenRefine has to create, the more heap memory that is needed. Sometimes you don't need all the Columns or Rows, or only SOME of the data neatly put into Cells and working with them. So Line-based text importer is the most lightweight way to load a dataset and conserve memory. If it exceeds that...then you need to either increase your RAM and OpenRefine's -Xmx setting. Or (with a backup of your data) try out our new architecture of OpenRefine in our 4.0 branch and build it. https://docs.openrefine.org/technical-reference/build-test-run It's super easy and we're here to help if you run into problems building it on your platform.
Jez (he/him)
@jez:petrichor.me
[m]
Great explanation, thanks Thad Guidry !
Thad Guidry
@thadg:matrix.org
[m]
I'd love to help you build our 4.0 branch and have you try it out on this large dataset!
Jez (he/him)
@jez:petrichor.me
[m]
I will experiment a bit with the various options you've suggested. It should be manageable one way or another as it's only 138,000 rows.
Thad Guidry
@thadg:matrix.org
[m]
but how WIDE are those rows :-)
and how many TABs per row? The more TABs per row, the more CELLs made per row in OpenRefine - MEMORY
Jez (he/him)
@jez:petrichor.me
[m]
Pretty wide... 😁
41 columns, quite a few of which are URLs
Thad Guidry
@thadg:matrix.org
[m]
Try just Line-based then (so that it's only 1 long cell per row) ...should be fine... then you can deal with splitting stuff into new columns.
Can you upload the dataset here? is it public or private? If private, you can send to my email or upload somewhere privately to share with me and happy to take a look and see what might be the best option for you.
Jez (he/him)
@jez:petrichor.me
[m]
Oh yeah, importing it as line-based worked, took about 20s 🀣
Thad Guidry
@thadg:matrix.org
[m]
Did you set the Character encoding option at the import time? UTF8?
Jez (he/him)
@jez:petrichor.me
[m]
Yup
Thad Guidry
@thadg:matrix.org
[m]
k
Jez (he/him)
@jez:petrichor.me
[m]
Right, I need to put this on a back-burner for now, but will pop back when I have time to pick it up again if I have more questions.
Thad Guidry
@thadg:matrix.org
[m]
so use Add column based on this column
sure thing. Remember if you use the Split into columns dialog... ensure to use the Split into X columns at most.
Jez (he/him)
@jez:petrichor.me
[m]
Otherwise I just recreate the original problem πŸ˜€
Thad Guidry
@thadg:matrix.org
[m]
or better yet, just use Edit columns -> Add column based on this column and use value.split() function with an [0] index
right, otherwise repeat problem
rinse and repeat on the value.split()[1] etc. etc. [3] , [4] increasing as you need each tab or whatever tab string extracted into a new column
value.split(/\t/)[6] <-- using regex and tab character
Jez (he/him)
@jez:petrichor.me
[m]
Final q for now: what compression algorithms can OR cope with in an uploaded file? .zip, .gz and .bz2 are listed in the manual so I assume that's it?
Thad Guidry
@thadg:matrix.org
[m]
yeah those. I thought we had that in the beginning of our http://docs.openrefine.org
Jez (he/him)
@jez:petrichor.me
[m]
Yeah, it's there! I was just wondering if there was anything else (e.g. zstd) that is implemented but not documented yet.
It did not like my zstd file 🀣
Not urgent, but zstd as an option would be a nice addition, it strikes a really nice balance between speed and compression ratio from what I've seen
I guess it depends what's available in Java standard library vs adding Yet Another Dependency
Thad Guidry
@thadg:matrix.org
[m]
I don't think we have support for other archive file types. You'd have to look at our source code in the importers, and then go up from that level to see which library and parameters we allow. I think those are the only ones
Jez (he/him)
@jez:petrichor.me
[m]
Anyway, thanks!
But we could add support for anything that Apache Commons library supports https://commons.apache.org/proper/commons-compress/examples.html
Thad Guidry
@thadg:matrix.org
[m]
I've added a feature request: OpenRefine/OpenRefine#4058
Anthony Del Rosario
@ADelRosarioH
Hi everyone! I recently developed a .NET client for OpenRefine (https://github.com/ADelRosarioH/OpenRefine.Net) I would like to submit a PR adding it to the download page of OpenRefine.org but I can't find any contribution guidelines in the website repo. Should I just create a PR?
2 replies
bhargavii
@BhargaviChada

Hello everyone,

I'm new to this project, so was looking into issues with good first issue tag,I came across the "#4007 SQL importer web UI should return the DatabaseServiceException message instead of generic error" issue ,
when I tried replicating the error, the message in the web UI isn't generic "error:Bad Request", instead it is specifying the corresponding SQL query followed by "command denied to user".
So I believe this issue doesn't persist anymore.
If it does persist, can I know which folders do I have to look into in order to resolve this.

OpenRefine/OpenRefine#4007

1 reply
dave0529
@dave0529
Hello,
I am a junior data engineer in South Korea.
I have some problems while installing OpenRefine CKAN Storage Extension on Openrefine 3.1 version.
I tried to solve the problem using https://github.com/OpenGov-OpenData/openrefine-ckan-storage-extension, but it didn't work.
Could you tell me how to configure?
2 replies