code logs -> 2023 -> Sun, 10 Sep 2023< code.20230909.log - code.20230911.log >
--- Log opened Sun Sep 10 00:00:58 2023
00:01 Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has quit [Ping timeout: 121 seconds]
00:55 Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Connection closed]
01:08 synapse [synapse@Nightstar-qfvjj5.mechanicus.xyz] has joined #code
01:08
< synapse>
hello! I have a code question.
01:09
< synapse>
I'm converting CC-CEDICT (a popular Chinese-English dictionary freely available) into a Chinese lexer
01:10
< synapse>
so basically, given a free-form text, identify positions of known words.
01:11 Kimo|autojoin is now known as Kindamoody
01:13
< synapse>
the way I thought I'd do it is by feeding my list of words into a prefix-trie so that I can take a series of Chinese characters and basically ask what's the longest prefix that maps to a dictionary entry. if/when I exceed the longest known entry, I've found the longest known word, and I'll use that (e.g. translate or use the pinyin, or whatever, depending on the use-case)
01:15
< synapse>
e.g. "你好朋友" would first have an entry for 你 (being a dictionary entry of its own), and then 你好 (also being a dictionary entry of its own), and then 你好朋 would fail because the dictionary contains no words that begin with this. this yields the dictionary entry for 你好, and I continue to lex at 朋友.
01:16
< synapse>
(I'm doing this in Rust, by the way.)
01:18
< synapse>
pub struct Dictionary { pub dict_entries: Vec<cedict::DictEntry<String>>, pub simplified_map: HashMap<String, usize>, }
01:19
< synapse>
the simplified_map is an example of how I overcome borrowing problems: it just maps a string containing a simplified chinese word (e.g. "你好".to_string()) to its dictionary entry, without duplicating the entries; it's just an index into the vector.
01:21
< synapse>
before I find a good prefix-trie, I think I can do with a prefix_map: HashMap<String, Vec<(usize, bool)>>, where 'String' is the prefix, 'usize' is the position in the dict_entries vector, and 'bool' indicates if this prefix is a complete word, or an incomplete prefix.
01:23
< synapse>
(In a previous attempt to do this in Haskell, I used a prefix-trie that branched 2^8 times per level, which was great for efficiency, but didn't differentiate between valid words and prefixes. using utf-8 now, I can't really branch 2^16 times per level. so I think I need to write my own prefix-trie that specifically lets me insert a full string and annotates down the tree if it's a partial or a complete
01:23
< synapse>
word.)
01:24
< synapse>
so now my question: does anyone know a prefix trie in Rust that does this?
01:25
< synapse>
I don't find a lot of very popular crates for this. but I also get a little confused about efficient branching factor vs. dealing correctly with utf-8 prefixes.
01:25
< synapse>
also, thanks for allowing monologues.
01:54
< ToxicFrog`>
I am not particularly conversant in Rust, but this does sound like an interesting problem.
01:54 ToxicFrog` is now known as ToxicFrog
01:58 Degi_ [Degi@Nightstar-b8v.8lk.13.77.IP] has joined #code
01:59 Degi [Degi@Nightstar-n0n.jgb.13.77.IP] has quit [Ping timeout: 121 seconds]
01:59 Degi_ is now known as Degi
02:10 ErikMesoy [Bruker@Nightstar-5n0.ghp.211.84.IP] has quit [Connection reset by peer]
02:55
< synapse>
I think my problem with the prefix trie that branches e.g. 256 times per layer is this: say I have 朋友, and I want to say "Yes, there is a word that begins with 朋, but 朋 isn't a word." -- but if the first byte of 朋 has a "this is a prefix but not a complete word"-node in the trie, that's not really the same.
02:57
< synapse>
so with a prefix-trie that branches 16 or 256 times (which I think are good choices performance-wise), I actually need three kinds of nodes: "this is just a prefix, but does not mark the end of a utf-8 char", "this is a prefix that marks the end of a utf-8 char, but not a complete word", and "this is actually a word"...
02:57
< synapse>
but I feel like I'm complicating this because my idea of how this is well-made is still growing very slowly with every sentence I make.
03:00
< synapse>
so... for 256-way branching, I actually know based on the branch number and the depth of the tree if it's a valid utf-8 character
03:00
< synapse>
(I could say the same for 16-way branching, but I'm trying to be concrete to simplify)
03:01
< synapse>
I think I want to try and express something called an utf-8 trie.
03:02
< synapse>
haha. the moment I know what to call it, I also know what to look for.
03:02
< synapse>
https://github.com/rust-lang/rust/pull/33098
03:04
< synapse>
https://docs.rs/unicode-id-start/latest/unicode_id_start/#ucd-trie
03:05
< synapse>
he uses a completely different representation than what I had in mind.
03:05
< synapse>
I probably want a more classical trie, since I'd really like to have an iterator over words with a given prefix.
03:15 * synapse is gonna go to bed, will look at this tomorrow.
03:24 panzerfaus [uid279771@Nightstar-l49s2t.irccloud.com] has quit [[NS] Quit: Connection closed for inactivity]
06:35 ErikMesoy [Bruker@Nightstar-5n0.ghp.211.84.IP] has joined #code
06:35 mode/#code [+o ErikMesoy] by ChanServ
10:05 Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has joined #code
11:03 Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has quit [Ping timeout: 121 seconds]
14:57 Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has joined #code
14:57 mode/#code [+qo Vornicus Vornicus] by ChanServ
15:49
< abudhabi>
https://twitter.com/mathemagic1an/status/1700232760756207956
17:36 Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Connection closed]
19:45 abudhabi [abudhabi@Nightstar-854261.supernova.orange.pl] has quit [[NS] Quit: Leaving]
20:02 Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has joined #code
20:08 abudhabi [abudhabi@Nightstar-854261.supernova.orange.pl] has joined #code
20:20 Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has quit [NickServ (RECOVER command used by M-E)]
20:20 Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has joined #code
22:52 Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has quit [Ping timeout: 121 seconds]
--- Log closed Mon Sep 11 00:00:59 2023
code logs -> 2023 -> Sun, 10 Sep 2023< code.20230909.log - code.20230911.log >

[ Latest log file ]