Forum Moderators: phranque

Message Too Old, No Replies

Want your own ML based app/site search?

SPTAG (Space Partition Tree And Graph) released

         

iamlost

5:49 pm on May 16, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Want to play with your very own app/site search?
Want to step into practical use machine learning?

Microsoft has released Bing's SPTAG (Space Partition Tree And Graph) [github.com] under MIT licence.

A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.

...

This library assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.

SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relative neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.

Note: written in C++ with Python wrapper.

NickMNS

9:14 pm on May 16, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What makes this better than the many other packages out there, most notably SciKitLearn in python.
[scikit-learn.org...]

iamlost

3:19 am on May 17, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@NickMNS: haven't a clue :)
I haven't used either the SPTAG mentioned in OP nor the SciKitLearn you thought comparable. I just ran across it and thought it interesting. If this was a few years back before I went to the time and trouble of reinventing the wheel aka building my own ML backed site search I'd probably be neck deep in trialing it but as is, not.

It's just nice that this sort of stuff is increasingly shared; makes it so much easier to step in a shallow end as opposed to prior years ago when only the rather chilly deeps were available. Machine learning is not going away, to be current some exposure is becoming a requirement outside hobby sites.

NickMNS

5:21 pm on May 17, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah I see now why the sudden interest.
[seroundtable.com...]

I find this very disingenuous on the part Microsoft as they are hyping this as a simple black-box solution to allow someone be to create a search engine that would be able to find the height of the Eiffel Tower by searching for "how high is the tower in Paris". What they are providing is the algorithm to be able to create such a search engine. Yes, there is value in that, but it is not sufficient to build anything nearly as sophisticated as they claim. The real "work" in this is determining the features on which to vectorize the terms. How would you vectorize "Eiffel Tower", "Paris", "France" and any and all related terms?

(Side note, maybe the searcher wanted the height of the "Tour de Montparnasse" and not "Tour Eiffel")

All this being said, the point of interest is that they appear to have developed a Nearest Neighbor like algorithm that is far more computationally efficient than the more common models that exist like the ones I referenced in SciKit-Learn.