Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Sep 27 12:20

    eklem on sami-stopwords

    contributor page added Update issue templates Fixing license issue and 2 more (compare)

  • Sep 25 01:10
    fergiemcdowall commented #297
  • Sep 24 13:56
    eklem commented #297
  • Sep 24 13:55
    eklem commented #297
  • Sep 24 12:07
    fergiemcdowall commented #297
  • Sep 24 06:47

    eklem on master

    Too much noice with bug/feature… (compare)

  • Sep 24 06:45
    eklem closed #297
  • Sep 24 06:45
    eklem commented #297
  • Sep 24 06:40

    eklem on master

    Fixing license issue (compare)

  • Sep 24 06:35
    eklem assigned #297
  • Sep 24 06:35
    eklem labeled #297
  • Sep 24 06:35
    eklem opened #297
  • Sep 23 12:23

    eklem on master

    Update issue templates (compare)

  • Sep 23 12:21

    eklem on master

    contributor page added (compare)

  • Sep 14 19:23

    eklem on v2.1.0-rc.2

    (compare)

  • Sep 14 19:17

    eklem on v2.1.0-rc.1

    (compare)

  • Sep 14 07:59

    eklem on sami-stopwords

    correct release candidate versi… (compare)

  • Sep 13 13:13

    eklem on sami-stopwords

    renamed north sami -> northern … (compare)

  • Sep 13 13:11

    eklem on sami-stopwords

    Added North Sami stopwords - be… (compare)

  • Sep 12 09:15
    fergiemcdowall commented #591
Fergus McDowall
@fergiemcdowall
Hi @jeffsee55 and thanks for reaching out- you have found a bug:
basically favoriteposts finds hits, but favoritePosts (large 'P') doesn't
if you initialize search-index with caseSensitive: true then the field name favoritePosts works as expected
Jeff See
@jeffsee55
That's one thing I noticed, which was easier to fix by just casting them to lowercase, the more surprising thing for me was the difference between an array value and a non-array
Though I noticed this comment https://github.com/fergiemcdowall/search-index/issues/540#issuecomment-822635413. So it seems like this is expected, do array values not get indexed the same?
if you initialize search-index with caseSensitive: true then the field name favoritePosts works as expected
Thanks, I'll try that
Fergus McDowall
@fergiemcdowall
ah- yes sorry- I was a bit quick to reply there...
here is an example:
;(async () => {
  const si = require('search-index')
  const print = txt => console.log(JSON.stringify(txt, null, 2))
  const db = await si({
    name: 'arrays'
  })

  const data = [
    {
      name: 'Homer Simpson',
      favoritePosts: ['content/posts/welcome3.md'],
      _id: 'content/authors/homer.md'
    }
  ]

  await db.PUT(data)

  await db
    .QUERY({
      FIELD: ['favoriteposts'],
      VALUE: 'content/posts/welcome3.md'
    })
    .then(print)
})()
just to explain ->
when you use arrays, special characters are not stripped
Jeff See
@jeffsee55
Ok, makes sense why that works for me then, what's the reason for the difference?
Fergus McDowall
@fergiemcdowall
so ['content/posts/welcome3.md'] is stored in the index as 'content/posts/welcome3.md'
whereas if you simply do 'content/posts/welcome3.md' (no array), then special chars will be be stripped and the string will be tokenized
Jeff See
@jeffsee55

Would

  const data = [
    {
      name: 'Homer Simpson',
      favoritePosts: ['content/posts/welcome3.md', 'content/post/welcome4.md'],
      _id: 'content/authors/homer.md'
    }
  ]

Be indexed as 'content/posts/welcome3.md' and 'content/posts/welcome4.md' as separate fields somehow (sorry, might be a confusing question)

Fergus McDowall
@fergiemcdowall
yes
If you want to investigate this you can use level-out to inspect the index
level-out arrays
{"key":"favoriteposts:content/posts/welcome3.md#1.00","value":["content/authors/homer.md"]}
{"key":"name:homer#1.00","value":["content/authors/homer.md"]}
{"key":"name:simpson#1.00","value":["content/authors/homer.md"]}
{"key":"○DOCUMENT_COUNT○","value":1}
{"key":"○DOC_RAW○content/authors/homer.md○","value":{"name":"Homer Simpson","favoritePosts":["content/posts/welcome3.md"],"_id":"content/authors/homer.md"}}
{"key":"○FIELD○favoriteposts○","value":"favoriteposts"}
{"key":"○FIELD○name○","value":"name"}
{"key":"○○CREATED","value":1622729161588}
Jeff See
@jeffsee55
Ah that would be awesome, I was looking for a way to to do that
Fergus McDowall
@fergiemcdowall
level-out is really handy for seeing how things are being indexed
BTW- indexing and query pipelines are going to be much better in search-index@3 (coming soon)
Jeff See
@jeffsee55
I saw that! Seems like it's close, is the reason array values don't get stripped of special characters for something specific?
Fergus McDowall
@fergiemcdowall
Yes- its so that you can easily do your own tokenization and use non-ascii chars
So for instance- I work a lot with scandinavian languages and sentences like "bøker er gøy" are not always tokenized correctly
(due to the non-ascii ø)
Jeff See
@jeffsee55
Ok, so if I had an item like:
  const data = [
    {
      name: 'Homer Simpson',
      comments: ['doh', 'this is another comment'],
      _id: 'content/authors/homer.md'
    }
  ]
Fergus McDowall
@fergiemcdowall
therefore its easier to do ['bøker', 'er', 'gøy']
You could also introduce ngrams: ['bøker er', 'er gøy']
Jeff See
@jeffsee55

Would I get a hit for:

  await db
    .QUERY({
      FIELD: ['comments'],
      VALUE: 'another'
    })
    .then(print

Or does that need to be indexed manually by me

Fergus McDowall
@fergiemcdowall
ëtc.
lets see...
no
Jeff See
@jeffsee55
Ok that's what I'm seeing too I just wasn't sure if I was doing something wrong
Fergus McDowall
@fergiemcdowall
 const data = [
    {
      name: 'Homer Simpson',
      comments: 'doh this is another comment',
      _id: 'content/authors/homer.md'
    }
  ]
This would tokenize comments to 'doh', 'this', 'is', 'another', 'comment'
and you would then get a hit for VALUE: 'another'
comments: ['doh', 'this is another comment'] would allow you to search for VALUE: 'doh' and VALUE: 'this is another comment'
Jeff See
@jeffsee55
Ok, thanks for the explanation. I'm trying to see if I can use this for full search of potentially complex objects. Some might have an array of objects - which I'd like to sort of "flatten" out into something that gets indexed in the normal way
But maybe what I'm starting to understand is that there's no reason to keep those values as arrays, I can just merge them as you've done in the last example into a single string
Fergus McDowall
@fergiemcdowall
Yes, for american english, that is probably the easiest
If you want
to preserve spesial chars and punctuation, then arrays are the way to go
Jeff See
@jeffsee55

to preserve spesial chars and punctuation, then arrays are the way to go

But using arrays will also remove the ability to search for partial matches, correct? And to support those I'd have to use ngrams

Fergus McDowall
@fergiemcdowall
yes, you could also look into DICTIONARY
Nice to talk to you Jeff! I need to pick up my kid from kindergarten now, but I can get back to you later if you have any more questions
Jeff See
@jeffsee55
Thanks @fergiemcdowall for the help, I'll see if I can learn more about DICTIONARY
brightinnovator
@brightinnovator
I have a question on javascript. can someone help me on the javascript issue?
brightinnovator
@brightinnovator

I want to read 3 crore csv rows which is of 2GB csv file size and need to insert into MySQL via Java.

Could someone please help me know the fastest and memory efficient way to avoid out of memory exception as well load in lesser time?

Please kindly advise.

goodev2021
@goodev2021
how to develop an full content text search application using search-index with watermelon db (https://watermelondb.now.sh/)?
Fergus McDowall
@fergiemcdowall
You would need to find a leveldown that works for watermelondb
You would need to find a leveldown that works for watermelondb