init

2026-02-16 15:50:16 +03:00
commit afb81b8278
13816 changed files with 3689732 additions and 0 deletions
--- a/Telegram/ThirdParty/cld3/README.md
+++ b/Telegram/ThirdParty/cld3/README.md
@@ -0,0 +1,191 @@
+# Compact Language Detector v3 (CLD3)
+
+* [Model](#model)
+* [Supported Languages](#supported-languages)
+* [Installation](#installation)
+* [Bugs and Feature Requests](#bugs-and-feature-requests)
+* [Credits](#credits)
+
+### Model
+
+CLD3 is a neural network model for language identification. This package
+ contains the inference code and a trained model. The inference code
+ extracts character ngrams from the input text and computes the fraction
+ of times each of them appears. For example, as shown in the figure below,
+ if the input text is "banana", then one of the extracted trigrams is "ana"
+ and the corresponding fraction is 2/4. The ngrams are hashed down to an id
+ within a small range, and each id is represented by a dense embedding vector
+ estimated during training.
+
+The model averages the embeddings corresponding to each ngram type according
+ to the fractions, and the averaged embeddings are concatenated to produce
+ the embedding layer. The remaining components of the network are a hidden
+ (Rectified linear) layer and a softmax layer.
+
+To get a language prediction for the input text, we simply perform a forward
+ pass through the network.
+
+![Figure](model.png "CLD3")
+
+### Supported Languages
+
+The model outputs BCP-47-style language codes, shown in the table below. For
+some languages, output is differentiated by script. Language and script names
+from
+[Unicode CLDR](https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en).
+
+Output Code | Language Name   | Script Name
+----------- | --------------- | ------------------------------------------
+af          | Afrikaans       | Latin
+am          | Amharic         | Ethiopic
+ar          | Arabic          | Arabic
+bg          | Bulgarian       | Cyrillic
+bg-Latn     | Bulgarian       | Latin
+bn          | Bangla          | Bangla
+bs          | Bosnian         | Latin
+ca          | Catalan         | Latin
+ceb         | Cebuano         | Latin
+co          | Corsican        | Latin
+cs          | Czech           | Latin
+cy          | Welsh           | Latin
+da          | Danish          | Latin
+de          | German          | Latin
+el          | Greek           | Greek
+el-Latn     | Greek           | Latin
+en          | English         | Latin
+eo          | Esperanto       | Latin
+es          | Spanish         | Latin
+et          | Estonian        | Latin
+eu          | Basque          | Latin
+fa          | Persian         | Arabic
+fi          | Finnish         | Latin
+fil         | Filipino        | Latin
+fr          | French          | Latin
+fy          | Western Frisian | Latin
+ga          | Irish           | Latin
+gd          | Scottish Gaelic | Latin
+gl          | Galician        | Latin
+gu          | Gujarati        | Gujarati
+ha          | Hausa           | Latin
+haw         | Hawaiian        | Latin
+hi          | Hindi           | Devanagari
+hi-Latn     | Hindi           | Latin
+hmn         | Hmong           | Latin
+hr          | Croatian        | Latin
+ht          | Haitian Creole  | Latin
+hu          | Hungarian       | Latin
+hy          | Armenian        | Armenian
+id          | Indonesian      | Latin
+ig          | Igbo            | Latin
+is          | Icelandic       | Latin
+it          | Italian         | Latin
+iw          | Hebrew          | Hebrew
+ja          | Japanese        | Japanese
+ja-Latn     | Japanese        | Latin
+jv          | Javanese        | Latin
+ka          | Georgian        | Georgian
+kk          | Kazakh          | Cyrillic
+km          | Khmer           | Khmer
+kn          | Kannada         | Kannada
+ko          | Korean          | Korean
+ku          | Kurdish         | Latin
+ky          | Kyrgyz          | Cyrillic
+la          | Latin           | Latin
+lb          | Luxembourgish   | Latin
+lo          | Lao             | Lao
+lt          | Lithuanian      | Latin
+lv          | Latvian         | Latin
+mg          | Malagasy        | Latin
+mi          | Maori           | Latin
+mk          | Macedonian      | Cyrillic
+ml          | Malayalam       | Malayalam
+mn          | Mongolian       | Cyrillic
+mr          | Marathi         | Devanagari
+ms          | Malay           | Latin
+mt          | Maltese         | Latin
+my          | Burmese         | Myanmar
+ne          | Nepali          | Devanagari
+nl          | Dutch           | Latin
+no          | Norwegian       | Latin
+ny          | Nyanja          | Latin
+pa          | Punjabi         | Gurmukhi
+pl          | Polish          | Latin
+ps          | Pashto          | Arabic
+pt          | Portuguese      | Latin
+ro          | Romanian        | Latin
+ru          | Russian         | Cyrillic
+ru-Latn     | Russian         | English
+sd          | Sindhi          | Arabic
+si          | Sinhala         | Sinhala
+sk          | Slovak          | Latin
+sl          | Slovenian       | Latin
+sm          | Samoan          | Latin
+sn          | Shona           | Latin
+so          | Somali          | Latin
+sq          | Albanian        | Latin
+sr          | Serbian         | Cyrillic
+st          | Southern Sotho  | Latin
+su          | Sundanese       | Latin
+sv          | Swedish         | Latin
+sw          | Swahili         | Latin
+ta          | Tamil           | Tamil
+te          | Telugu          | Telugu
+tg          | Tajik           | Cyrillic
+th          | Thai            | Thai
+tr          | Turkish         | Latin
+uk          | Ukrainian       | Cyrillic
+ur          | Urdu            | Arabic
+uz          | Uzbek           | Latin
+vi          | Vietnamese      | Latin
+xh          | Xhosa           | Latin
+yi          | Yiddish         | Hebrew
+yo          | Yoruba          | Latin
+zh          | Chinese         | Han (including Simplified and Traditional)
+zh-Latn     | Chinese         | Latin
+zu          | Zulu            | Latin
+
+### Installation
+CLD3 is designed to run in the Chrome browser, so it relies on code in
+[Chromium](http://www.chromium.org/).
+The steps for building and running the demo of the language detection model are:
+
+- [check out](http://www.chromium.org/developers/how-tos/get-the-code) the
+  Chromium repository.
+- copy the code to `//third_party/cld_3`
+- Uncomment `language_identifier_main` executable in `src/BUILD.gn`.
+- build and run the model using the commands:
+
+```shell
+gn gen out/Default
+ninja -C out/Default third_party/cld_3/src/src:language_identifier_main
+out/Default/language_identifier_main
+```
+### Bugs and Feature Requests
+
+Open a [GitHub issue](https://github.com/google/cld3/issues) for this repository to file bugs and feature requests.
+
+### Announcements and Discussion
+
+For announcements regarding major updates as well as general discussion list, please subscribe to:
+[cld3-users@googlegroups.com](https://groups.google.com/forum/#!forum/cld3-users)
+
+### Credits
+
+Original authors of the code in this package include (in alphabetical order):
+
+* Alex Salcianu
+* Andy Golding
+* Anton Bakalov
+* Chris Alberti
+* Daniel Andor
+* David Weiss
+* Emily Pitler
+* Greg Coppola
+* Jason Riesa
+* Kuzman Ganchev
+* Michael Ringgaard
+* Nan Hua
+* Ryan McDonald
+* Slav Petrov
+* Stefan Istrate
+* Terry Koo