russianmorphology/README.md
2016-08-05 17:21:31 -04:00

71 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Russian Morphology for lucene
Russian and English morphology for java and lucene 6.1 framework based on open source dictionary from site [АОТ](http://aot.ru). It use dictionary base morphology with some heuristics for unknown words. It support homonym for example for Russian word "вина" it gives two variants "вино" and "вина".
### How to use
First download
[morph-1.2.jar](https://bintray.com/artifact/download/akuznetsov/russianmorphology/org/apache/lucene/morphology/morph/1.2/morph-1.2.jar)
and add it to your class path. When download [Russian](https://bintray.com/artifact/download/akuznetsov/russianmorphology/org/apache/lucene/morphology/russian/1.2/russian-1.2.jar) or
[English](https://bintray.com/artifact/download/akuznetsov/russianmorphology/org/apache/lucene/morphology/english/1.2/english-1.2.jar) package.
If you use maven you can add dependency
<dependency>
<groupId>org.apache.lucene.morphology</groupId>
<artifactId>russian</artifactId>
<version>1.2</version>
</dependency>
<dependency>
<groupId>org.apache.lucene.morphology</groupId>
<artifactId>english</artifactId>
<version>1.2</version>
</dependency>
Don't forget add link to repository
<repositories>
...............
<repository>
<snapshots>
<enabled>false</enabled>
</snapshots>
<id>bintray-akuznetsov-russianmorphology</id>
<name>bintray</name>
<url>http://dl.bintray.com/akuznetsov/russianmorphology</url>
</repository>
</repositories>
Now you can create a Lucene Analyzer
RussianAnalayzer russian = new RussianAnalayzer();
EnglishAnalayzer english = new EnglishAnalayzer();
You can write you own analyzer using filter that convert word in it's right forms.
LuceneMorphology luceneMorph = new EnglishLuceneMorphology();
TokenStream tokenStream = new MorphlogyFilter(result, luceneMorph);
Because usually LuceneMorphology contains a lot data needing for it functionality, it is better didn't create this object for each MorphologyFilter.
Also if you need get a list of base forms of word, you can use following example
LuceneMorphology luceneMorph = new EnglishLuceneMorphology();
List<String> wordBaseForms = luceneMorph.getMorphInfo(word);
### Restrictions
* It works only with UTF-8.
* It assume what letters е and ё are the same.
* Word forms with prefixes like "наибольший" treated as separate word.