init
Some checks failed
Docker. / Ubuntu (push) Has been cancelled
User-agent updater. / User-agent (push) Failing after 15s
Lock Threads / lock (push) Failing after 10s
Waiting for answer. / waiting-for-answer (push) Failing after 22s
Needs user action. / needs-user-action (push) Failing after 8s
Can't reproduce. / cant-reproduce (push) Failing after 8s
Close stale issues and PRs / stale (push) Has been cancelled
Some checks failed
Docker. / Ubuntu (push) Has been cancelled
User-agent updater. / User-agent (push) Failing after 15s
Lock Threads / lock (push) Failing after 10s
Waiting for answer. / waiting-for-answer (push) Failing after 22s
Needs user action. / needs-user-action (push) Failing after 8s
Can't reproduce. / cant-reproduce (push) Failing after 8s
Close stale issues and PRs / stale (push) Has been cancelled
This commit is contained in:
191
Telegram/ThirdParty/cld3/README.md
vendored
Normal file
191
Telegram/ThirdParty/cld3/README.md
vendored
Normal file
@@ -0,0 +1,191 @@
|
||||
# Compact Language Detector v3 (CLD3)
|
||||
|
||||
* [Model](#model)
|
||||
* [Supported Languages](#supported-languages)
|
||||
* [Installation](#installation)
|
||||
* [Bugs and Feature Requests](#bugs-and-feature-requests)
|
||||
* [Credits](#credits)
|
||||
|
||||
### Model
|
||||
|
||||
CLD3 is a neural network model for language identification. This package
|
||||
contains the inference code and a trained model. The inference code
|
||||
extracts character ngrams from the input text and computes the fraction
|
||||
of times each of them appears. For example, as shown in the figure below,
|
||||
if the input text is "banana", then one of the extracted trigrams is "ana"
|
||||
and the corresponding fraction is 2/4. The ngrams are hashed down to an id
|
||||
within a small range, and each id is represented by a dense embedding vector
|
||||
estimated during training.
|
||||
|
||||
The model averages the embeddings corresponding to each ngram type according
|
||||
to the fractions, and the averaged embeddings are concatenated to produce
|
||||
the embedding layer. The remaining components of the network are a hidden
|
||||
(Rectified linear) layer and a softmax layer.
|
||||
|
||||
To get a language prediction for the input text, we simply perform a forward
|
||||
pass through the network.
|
||||
|
||||

|
||||
|
||||
### Supported Languages
|
||||
|
||||
The model outputs BCP-47-style language codes, shown in the table below. For
|
||||
some languages, output is differentiated by script. Language and script names
|
||||
from
|
||||
[Unicode CLDR](https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en).
|
||||
|
||||
Output Code | Language Name | Script Name
|
||||
----------- | --------------- | ------------------------------------------
|
||||
af | Afrikaans | Latin
|
||||
am | Amharic | Ethiopic
|
||||
ar | Arabic | Arabic
|
||||
bg | Bulgarian | Cyrillic
|
||||
bg-Latn | Bulgarian | Latin
|
||||
bn | Bangla | Bangla
|
||||
bs | Bosnian | Latin
|
||||
ca | Catalan | Latin
|
||||
ceb | Cebuano | Latin
|
||||
co | Corsican | Latin
|
||||
cs | Czech | Latin
|
||||
cy | Welsh | Latin
|
||||
da | Danish | Latin
|
||||
de | German | Latin
|
||||
el | Greek | Greek
|
||||
el-Latn | Greek | Latin
|
||||
en | English | Latin
|
||||
eo | Esperanto | Latin
|
||||
es | Spanish | Latin
|
||||
et | Estonian | Latin
|
||||
eu | Basque | Latin
|
||||
fa | Persian | Arabic
|
||||
fi | Finnish | Latin
|
||||
fil | Filipino | Latin
|
||||
fr | French | Latin
|
||||
fy | Western Frisian | Latin
|
||||
ga | Irish | Latin
|
||||
gd | Scottish Gaelic | Latin
|
||||
gl | Galician | Latin
|
||||
gu | Gujarati | Gujarati
|
||||
ha | Hausa | Latin
|
||||
haw | Hawaiian | Latin
|
||||
hi | Hindi | Devanagari
|
||||
hi-Latn | Hindi | Latin
|
||||
hmn | Hmong | Latin
|
||||
hr | Croatian | Latin
|
||||
ht | Haitian Creole | Latin
|
||||
hu | Hungarian | Latin
|
||||
hy | Armenian | Armenian
|
||||
id | Indonesian | Latin
|
||||
ig | Igbo | Latin
|
||||
is | Icelandic | Latin
|
||||
it | Italian | Latin
|
||||
iw | Hebrew | Hebrew
|
||||
ja | Japanese | Japanese
|
||||
ja-Latn | Japanese | Latin
|
||||
jv | Javanese | Latin
|
||||
ka | Georgian | Georgian
|
||||
kk | Kazakh | Cyrillic
|
||||
km | Khmer | Khmer
|
||||
kn | Kannada | Kannada
|
||||
ko | Korean | Korean
|
||||
ku | Kurdish | Latin
|
||||
ky | Kyrgyz | Cyrillic
|
||||
la | Latin | Latin
|
||||
lb | Luxembourgish | Latin
|
||||
lo | Lao | Lao
|
||||
lt | Lithuanian | Latin
|
||||
lv | Latvian | Latin
|
||||
mg | Malagasy | Latin
|
||||
mi | Maori | Latin
|
||||
mk | Macedonian | Cyrillic
|
||||
ml | Malayalam | Malayalam
|
||||
mn | Mongolian | Cyrillic
|
||||
mr | Marathi | Devanagari
|
||||
ms | Malay | Latin
|
||||
mt | Maltese | Latin
|
||||
my | Burmese | Myanmar
|
||||
ne | Nepali | Devanagari
|
||||
nl | Dutch | Latin
|
||||
no | Norwegian | Latin
|
||||
ny | Nyanja | Latin
|
||||
pa | Punjabi | Gurmukhi
|
||||
pl | Polish | Latin
|
||||
ps | Pashto | Arabic
|
||||
pt | Portuguese | Latin
|
||||
ro | Romanian | Latin
|
||||
ru | Russian | Cyrillic
|
||||
ru-Latn | Russian | English
|
||||
sd | Sindhi | Arabic
|
||||
si | Sinhala | Sinhala
|
||||
sk | Slovak | Latin
|
||||
sl | Slovenian | Latin
|
||||
sm | Samoan | Latin
|
||||
sn | Shona | Latin
|
||||
so | Somali | Latin
|
||||
sq | Albanian | Latin
|
||||
sr | Serbian | Cyrillic
|
||||
st | Southern Sotho | Latin
|
||||
su | Sundanese | Latin
|
||||
sv | Swedish | Latin
|
||||
sw | Swahili | Latin
|
||||
ta | Tamil | Tamil
|
||||
te | Telugu | Telugu
|
||||
tg | Tajik | Cyrillic
|
||||
th | Thai | Thai
|
||||
tr | Turkish | Latin
|
||||
uk | Ukrainian | Cyrillic
|
||||
ur | Urdu | Arabic
|
||||
uz | Uzbek | Latin
|
||||
vi | Vietnamese | Latin
|
||||
xh | Xhosa | Latin
|
||||
yi | Yiddish | Hebrew
|
||||
yo | Yoruba | Latin
|
||||
zh | Chinese | Han (including Simplified and Traditional)
|
||||
zh-Latn | Chinese | Latin
|
||||
zu | Zulu | Latin
|
||||
|
||||
### Installation
|
||||
CLD3 is designed to run in the Chrome browser, so it relies on code in
|
||||
[Chromium](http://www.chromium.org/).
|
||||
The steps for building and running the demo of the language detection model are:
|
||||
|
||||
- [check out](http://www.chromium.org/developers/how-tos/get-the-code) the
|
||||
Chromium repository.
|
||||
- copy the code to `//third_party/cld_3`
|
||||
- Uncomment `language_identifier_main` executable in `src/BUILD.gn`.
|
||||
- build and run the model using the commands:
|
||||
|
||||
```shell
|
||||
gn gen out/Default
|
||||
ninja -C out/Default third_party/cld_3/src/src:language_identifier_main
|
||||
out/Default/language_identifier_main
|
||||
```
|
||||
### Bugs and Feature Requests
|
||||
|
||||
Open a [GitHub issue](https://github.com/google/cld3/issues) for this repository to file bugs and feature requests.
|
||||
|
||||
### Announcements and Discussion
|
||||
|
||||
For announcements regarding major updates as well as general discussion list, please subscribe to:
|
||||
[cld3-users@googlegroups.com](https://groups.google.com/forum/#!forum/cld3-users)
|
||||
|
||||
### Credits
|
||||
|
||||
Original authors of the code in this package include (in alphabetical order):
|
||||
|
||||
* Alex Salcianu
|
||||
* Andy Golding
|
||||
* Anton Bakalov
|
||||
* Chris Alberti
|
||||
* Daniel Andor
|
||||
* David Weiss
|
||||
* Emily Pitler
|
||||
* Greg Coppola
|
||||
* Jason Riesa
|
||||
* Kuzman Ganchev
|
||||
* Michael Ringgaard
|
||||
* Nan Hua
|
||||
* Ryan McDonald
|
||||
* Slav Petrov
|
||||
* Stefan Istrate
|
||||
* Terry Koo
|
||||
Reference in New Issue
Block a user