Some checks failed
Docker. / Ubuntu (push) Has been cancelled
User-agent updater. / User-agent (push) Failing after 15s
Lock Threads / lock (push) Failing after 10s
Waiting for answer. / waiting-for-answer (push) Failing after 22s
Needs user action. / needs-user-action (push) Failing after 8s
Can't reproduce. / cant-reproduce (push) Failing after 8s
Close stale issues and PRs / stale (push) Has been cancelled
192 lines
6.9 KiB
Markdown
192 lines
6.9 KiB
Markdown
# Compact Language Detector v3 (CLD3)
|
|
|
|
* [Model](#model)
|
|
* [Supported Languages](#supported-languages)
|
|
* [Installation](#installation)
|
|
* [Bugs and Feature Requests](#bugs-and-feature-requests)
|
|
* [Credits](#credits)
|
|
|
|
### Model
|
|
|
|
CLD3 is a neural network model for language identification. This package
|
|
contains the inference code and a trained model. The inference code
|
|
extracts character ngrams from the input text and computes the fraction
|
|
of times each of them appears. For example, as shown in the figure below,
|
|
if the input text is "banana", then one of the extracted trigrams is "ana"
|
|
and the corresponding fraction is 2/4. The ngrams are hashed down to an id
|
|
within a small range, and each id is represented by a dense embedding vector
|
|
estimated during training.
|
|
|
|
The model averages the embeddings corresponding to each ngram type according
|
|
to the fractions, and the averaged embeddings are concatenated to produce
|
|
the embedding layer. The remaining components of the network are a hidden
|
|
(Rectified linear) layer and a softmax layer.
|
|
|
|
To get a language prediction for the input text, we simply perform a forward
|
|
pass through the network.
|
|
|
|

|
|
|
|
### Supported Languages
|
|
|
|
The model outputs BCP-47-style language codes, shown in the table below. For
|
|
some languages, output is differentiated by script. Language and script names
|
|
from
|
|
[Unicode CLDR](https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en).
|
|
|
|
Output Code | Language Name | Script Name
|
|
----------- | --------------- | ------------------------------------------
|
|
af | Afrikaans | Latin
|
|
am | Amharic | Ethiopic
|
|
ar | Arabic | Arabic
|
|
bg | Bulgarian | Cyrillic
|
|
bg-Latn | Bulgarian | Latin
|
|
bn | Bangla | Bangla
|
|
bs | Bosnian | Latin
|
|
ca | Catalan | Latin
|
|
ceb | Cebuano | Latin
|
|
co | Corsican | Latin
|
|
cs | Czech | Latin
|
|
cy | Welsh | Latin
|
|
da | Danish | Latin
|
|
de | German | Latin
|
|
el | Greek | Greek
|
|
el-Latn | Greek | Latin
|
|
en | English | Latin
|
|
eo | Esperanto | Latin
|
|
es | Spanish | Latin
|
|
et | Estonian | Latin
|
|
eu | Basque | Latin
|
|
fa | Persian | Arabic
|
|
fi | Finnish | Latin
|
|
fil | Filipino | Latin
|
|
fr | French | Latin
|
|
fy | Western Frisian | Latin
|
|
ga | Irish | Latin
|
|
gd | Scottish Gaelic | Latin
|
|
gl | Galician | Latin
|
|
gu | Gujarati | Gujarati
|
|
ha | Hausa | Latin
|
|
haw | Hawaiian | Latin
|
|
hi | Hindi | Devanagari
|
|
hi-Latn | Hindi | Latin
|
|
hmn | Hmong | Latin
|
|
hr | Croatian | Latin
|
|
ht | Haitian Creole | Latin
|
|
hu | Hungarian | Latin
|
|
hy | Armenian | Armenian
|
|
id | Indonesian | Latin
|
|
ig | Igbo | Latin
|
|
is | Icelandic | Latin
|
|
it | Italian | Latin
|
|
iw | Hebrew | Hebrew
|
|
ja | Japanese | Japanese
|
|
ja-Latn | Japanese | Latin
|
|
jv | Javanese | Latin
|
|
ka | Georgian | Georgian
|
|
kk | Kazakh | Cyrillic
|
|
km | Khmer | Khmer
|
|
kn | Kannada | Kannada
|
|
ko | Korean | Korean
|
|
ku | Kurdish | Latin
|
|
ky | Kyrgyz | Cyrillic
|
|
la | Latin | Latin
|
|
lb | Luxembourgish | Latin
|
|
lo | Lao | Lao
|
|
lt | Lithuanian | Latin
|
|
lv | Latvian | Latin
|
|
mg | Malagasy | Latin
|
|
mi | Maori | Latin
|
|
mk | Macedonian | Cyrillic
|
|
ml | Malayalam | Malayalam
|
|
mn | Mongolian | Cyrillic
|
|
mr | Marathi | Devanagari
|
|
ms | Malay | Latin
|
|
mt | Maltese | Latin
|
|
my | Burmese | Myanmar
|
|
ne | Nepali | Devanagari
|
|
nl | Dutch | Latin
|
|
no | Norwegian | Latin
|
|
ny | Nyanja | Latin
|
|
pa | Punjabi | Gurmukhi
|
|
pl | Polish | Latin
|
|
ps | Pashto | Arabic
|
|
pt | Portuguese | Latin
|
|
ro | Romanian | Latin
|
|
ru | Russian | Cyrillic
|
|
ru-Latn | Russian | English
|
|
sd | Sindhi | Arabic
|
|
si | Sinhala | Sinhala
|
|
sk | Slovak | Latin
|
|
sl | Slovenian | Latin
|
|
sm | Samoan | Latin
|
|
sn | Shona | Latin
|
|
so | Somali | Latin
|
|
sq | Albanian | Latin
|
|
sr | Serbian | Cyrillic
|
|
st | Southern Sotho | Latin
|
|
su | Sundanese | Latin
|
|
sv | Swedish | Latin
|
|
sw | Swahili | Latin
|
|
ta | Tamil | Tamil
|
|
te | Telugu | Telugu
|
|
tg | Tajik | Cyrillic
|
|
th | Thai | Thai
|
|
tr | Turkish | Latin
|
|
uk | Ukrainian | Cyrillic
|
|
ur | Urdu | Arabic
|
|
uz | Uzbek | Latin
|
|
vi | Vietnamese | Latin
|
|
xh | Xhosa | Latin
|
|
yi | Yiddish | Hebrew
|
|
yo | Yoruba | Latin
|
|
zh | Chinese | Han (including Simplified and Traditional)
|
|
zh-Latn | Chinese | Latin
|
|
zu | Zulu | Latin
|
|
|
|
### Installation
|
|
CLD3 is designed to run in the Chrome browser, so it relies on code in
|
|
[Chromium](http://www.chromium.org/).
|
|
The steps for building and running the demo of the language detection model are:
|
|
|
|
- [check out](http://www.chromium.org/developers/how-tos/get-the-code) the
|
|
Chromium repository.
|
|
- copy the code to `//third_party/cld_3`
|
|
- Uncomment `language_identifier_main` executable in `src/BUILD.gn`.
|
|
- build and run the model using the commands:
|
|
|
|
```shell
|
|
gn gen out/Default
|
|
ninja -C out/Default third_party/cld_3/src/src:language_identifier_main
|
|
out/Default/language_identifier_main
|
|
```
|
|
### Bugs and Feature Requests
|
|
|
|
Open a [GitHub issue](https://github.com/google/cld3/issues) for this repository to file bugs and feature requests.
|
|
|
|
### Announcements and Discussion
|
|
|
|
For announcements regarding major updates as well as general discussion list, please subscribe to:
|
|
[cld3-users@googlegroups.com](https://groups.google.com/forum/#!forum/cld3-users)
|
|
|
|
### Credits
|
|
|
|
Original authors of the code in this package include (in alphabetical order):
|
|
|
|
* Alex Salcianu
|
|
* Andy Golding
|
|
* Anton Bakalov
|
|
* Chris Alberti
|
|
* Daniel Andor
|
|
* David Weiss
|
|
* Emily Pitler
|
|
* Greg Coppola
|
|
* Jason Riesa
|
|
* Kuzman Ganchev
|
|
* Michael Ringgaard
|
|
* Nan Hua
|
|
* Ryan McDonald
|
|
* Slav Petrov
|
|
* Stefan Istrate
|
|
* Terry Koo
|