Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://social.tchncs.de/users/pixelcode/statuses/111080990986558681">Pixelcode 🇺🇦 (pixelcode@social.tchncs.de)'s status on Monday, 18-Sep-2023 00:23:37 JST</a><a href="https://social.tchncs.de/@pixelcode" title="pixelcode@social.tchncs.de"><img src="https://gnusocial.jp/avatar/6755-48-20220826134516.webp" width="48" height="48" alt="Pixelcode 🇺🇦" style="position: absolute; left: 0; top: 0;">Pixelcode 🇺🇦</a><div><a href="https://gnusocial.jp/notice/4064075" rel="in-reply-to">in reply to</a><ul><li><li><a href="https://gnusocial.jp/user/14291" title="thilo@fromm.social">:thilo:</a></li><li><a href="https://gnusocial.jp/user/81997" title="aardrian@toot.cafe">Adrian Roselli</a></li><li><a href="https://gnusocial.jp/user/115519" title="odddev@hachyderm.io">Kai Klostermann</a></li></ul></div></section><article><p><a href="https://toot.cafe/@aardrian">@aardrian</a> <a href="https://hachyderm.io/@odddev">@odddev</a> <a href="https://fromm.social/@thilo">@thilo</a> <a href="https://mastodon.ar.al/@aral">@aral</a> </p><p>My approach would be this:</p><p>If there are any non-Latin characters present, tokenise. For each non-Latin token, use a pre-defined hash table to rewrite each symbol to its Latin equivalent (if there is one). If the result is a purely Latin token, lemmatise it to determine whether it's an existing word in the post's language. If so, read the natural word instead of the non-Latin token.</p></article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/2069476#notice-4064081">In conversation</a><time datetime="2023-09-18T00:23:37+09:00" title="Monday, 18-Sep-2023 00:23:37 JST">Monday, 18-Sep-2023 00:23:37 JST</time> <span>from <span><a href="https://social.tchncs.de/@pixelcode/111080990986558681" rel="external" title="Sent from social.tchncs.de via ActivityPub">social.tchncs.de</a></span></span><a href="https://social.tchncs.de/@pixelcode/111080990986558681">permalink</a></footer></blockquote>

Corresponding Notice

Embed this notice
Pixelcode 🇺🇦 (pixelcode@social.tchncs.de)'s status on Monday, 18-Sep-2023 00:23:37 JSTPixelcode 🇺🇦
in reply to
@aardrian @odddev @thilo @aral
My approach would be this:
If there are any non-Latin characters present, tokenise. For each non-Latin token, use a pre-defined hash table to rewrite each symbol to its Latin equivalent (if there is one). If the result is a purely Latin token, lemmatise it to determine whether it's an existing word in the post's language. If so, read the natural word instead of the non-Latin token.
In conversationMonday, 18-Sep-2023 00:23:37 JST from social.tchncs.depermalink

Public

Embed Notice

HTML Code

Corresponding Notice