Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://fire.asta.lgbt/notes/a64h9o7zxhl50351">Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JST</a><a href="https://fire.asta.lgbt/@aud" title="aud@fire.asta.lgbt"><img src="https://gnusocial.jp/theme/gnusocialjp/default-avatar-stream.png" width="48" height="48" alt="Asta [AMP]" style="position: absolute; left: 0; top: 0;">Asta [AMP]</a><div><a href="https://gnusocial.jp/notice/9472890" rel="in-reply-to">in reply to</a><ul><li><li><a href="https://gnusocial.jp/user/80072" title="trochee@dair-community.social">Jeremy Kahn</a></li></ul></div></section><article><p><a href="https://dair-community.social/@trochee">@trochee@dair-community.social</a> <a href="https://hachyderm.io/@skinnylatte">@skinnylatte@hachyderm.io</a> (I'm not a polyglot, but thankfully I'm not strictly a monoglot, either) I'm pretty much going with the idea that most assumptions I would make about language based on my knowledge aren't going to hold up, especially in languages that aren't as widely spoken or read, which is sort of where I would want to pay special attention.<br><br>Hmmmm.  I wonder if there's a language that is both A. "underserved" by technical tools and B. rather difficult to tokenize?  Sounds like a number of languages already fill condition B... and probably fill condition A.<br><br>Is there a better... mmm, model, either in the computational sense or otherwise, with which to approach how to break up the text?  Or "in theory", could tokenization work, it's just that not enough work has been done?</p></article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/4837170#notice-9472897">In conversation</a><time datetime="2025-04-03T14:31:25+09:00" title="Thursday, 03-Apr-2025 14:31:25 JST">about 2 months ago</time> <span>from <span><a href="https://fire.asta.lgbt/notes/a64h9o7zxhl50351" rel="external" title="Sent from fire.asta.lgbt via ActivityPub">fire.asta.lgbt</a></span></span><a href="https://fire.asta.lgbt/notes/a64h9o7zxhl50351">permalink</a></footer></blockquote>

Corresponding Notice

Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JSTAsta [AMP]
in reply to
- Adrianna Tan
- Jeremy Kahn
@trochee@dair-community.social @skinnylatte@hachyderm.io (I'm not a polyglot, but thankfully I'm not strictly a monoglot, either) I'm pretty much going with the idea that most assumptions I would make about language based on my knowledge aren't going to hold up, especially in languages that aren't as widely spoken or read, which is sort of where I would want to pay special attention.

Hmmmm. I wonder if there's a language that is both A. "underserved" by technical tools and B. rather difficult to tokenize? Sounds like a number of languages already fill condition B... and probably fill condition A.

Is there a better... mmm, model, either in the computational sense or otherwise, with which to approach how to break up the text? Or "in theory", could tokenization work, it's just that not enough work has been done?
In conversationabout 2 months ago from fire.asta.lgbtpermalink

Public

Embed Notice

HTML Code

Corresponding Notice