--- Log opened Sun Sep 10 00:00:58 2023 |
00:01 | | Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has quit [Ping timeout: 121 seconds] |
00:55 | | Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Connection closed] |
01:08 | | synapse [synapse@Nightstar-qfvjj5.mechanicus.xyz] has joined #code |
01:08 | < synapse> | hello! I have a code question. |
01:09 | < synapse> | I'm converting CC-CEDICT (a popular Chinese-English dictionary freely available) into a Chinese lexer |
01:10 | < synapse> | so basically, given a free-form text, identify positions of known words. |
01:11 | | Kimo|autojoin is now known as Kindamoody |
01:13 | < synapse> | the way I thought I'd do it is by feeding my list of words into a prefix-trie so that I can take a series of Chinese characters and basically ask what's the longest prefix that maps to a dictionary entry. if/when I exceed the longest known entry, I've found the longest known word, and I'll use that (e.g. translate or use the pinyin, or whatever, depending on the use-case) |
01:15 | < synapse> | e.g. "你好朋友" would first have an entry for 你 (being a dictionary entry of its own), and then 你好 (also being a dictionary entry of its own), and then 你好朋 would fail because the dictionary contains no words that begin with this. this yields the dictionary entry for 你好, and I continue to lex at 朋友. |
01:16 | < synapse> | (I'm doing this in Rust, by the way.) |
01:18 | < synapse> | pub struct Dictionary { pub dict_entries: Vec<cedict::DictEntry<String>>, pub simplified_map: HashMap<String, usize>, } |
01:19 | < synapse> | the simplified_map is an example of how I overcome borrowing problems: it just maps a string containing a simplified chinese word (e.g. "你好".to_string()) to its dictionary entry, without duplicating the entries; it's just an index into the vector. |
01:21 | < synapse> | before I find a good prefix-trie, I think I can do with a prefix_map: HashMap<String, Vec<(usize, bool)>>, where 'String' is the prefix, 'usize' is the position in the dict_entries vector, and 'bool' indicates if this prefix is a complete word, or an incomplete prefix. |
01:23 | < synapse> | (In a previous attempt to do this in Haskell, I used a prefix-trie that branched 2^8 times per level, which was great for efficiency, but didn't differentiate between valid words and prefixes. using utf-8 now, I can't really branch 2^16 times per level. so I think I need to write my own prefix-trie that specifically lets me insert a full string and annotates down the tree if it's a partial or a complete |
01:23 | < synapse> | word.) |
01:24 | < synapse> | so now my question: does anyone know a prefix trie in Rust that does this? |
01:25 | < synapse> | I don't find a lot of very popular crates for this. but I also get a little confused about efficient branching factor vs. dealing correctly with utf-8 prefixes. |
01:25 | < synapse> | also, thanks for allowing monologues. |
01:54 | < ToxicFrog`> | I am not particularly conversant in Rust, but this does sound like an interesting problem. |
01:54 | | ToxicFrog` is now known as ToxicFrog |
01:58 | | Degi_ [Degi@Nightstar-b8v.8lk.13.77.IP] has joined #code |
01:59 | | Degi [Degi@Nightstar-n0n.jgb.13.77.IP] has quit [Ping timeout: 121 seconds] |
01:59 | | Degi_ is now known as Degi |
02:10 | | ErikMesoy [Bruker@Nightstar-5n0.ghp.211.84.IP] has quit [Connection reset by peer] |
02:55 | < synapse> | I think my problem with the prefix trie that branches e.g. 256 times per layer is this: say I have 朋友, and I want to say "Yes, there is a word that begins with 朋, but 朋 isn't a word." -- but if the first byte of 朋 has a "this is a prefix but not a complete word"-node in the trie, that's not really the same. |
02:57 | < synapse> | so with a prefix-trie that branches 16 or 256 times (which I think are good choices performance-wise), I actually need three kinds of nodes: "this is just a prefix, but does not mark the end of a utf-8 char", "this is a prefix that marks the end of a utf-8 char, but not a complete word", and "this is actually a word"... |
02:57 | < synapse> | but I feel like I'm complicating this because my idea of how this is well-made is still growing very slowly with every sentence I make. |
03:00 | < synapse> | so... for 256-way branching, I actually know based on the branch number and the depth of the tree if it's a valid utf-8 character |
03:00 | < synapse> | (I could say the same for 16-way branching, but I'm trying to be concrete to simplify) |
03:01 | < synapse> | I think I want to try and express something called an utf-8 trie. |
03:02 | < synapse> | haha. the moment I know what to call it, I also know what to look for. |
03:02 | < synapse> | https://github.com/rust-lang/rust/pull/33098 |
03:04 | < synapse> | https://docs.rs/unicode-id-start/latest/unicode_id_start/#ucd-trie |
03:05 | < synapse> | he uses a completely different representation than what I had in mind. |
03:05 | < synapse> | I probably want a more classical trie, since I'd really like to have an iterator over words with a given prefix. |
03:15 | | * synapse is gonna go to bed, will look at this tomorrow. |
03:24 | | panzerfaus [uid279771@Nightstar-l49s2t.irccloud.com] has quit [[NS] Quit: Connection closed for inactivity] |
06:35 | | ErikMesoy [Bruker@Nightstar-5n0.ghp.211.84.IP] has joined #code |
06:35 | | mode/#code [+o ErikMesoy] by ChanServ |
10:05 | | Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has joined #code |
11:03 | | Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has quit [Ping timeout: 121 seconds] |
14:57 | | Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has joined #code |
14:57 | | mode/#code [+qo Vornicus Vornicus] by ChanServ |
15:49 | < abudhabi> | https://twitter.com/mathemagic1an/status/1700232760756207956 |
17:36 | | Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Connection closed] |
19:45 | | abudhabi [abudhabi@Nightstar-854261.supernova.orange.pl] has quit [[NS] Quit: Leaving] |
20:02 | | Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has joined #code |
20:08 | | abudhabi [abudhabi@Nightstar-854261.supernova.orange.pl] has joined #code |
20:20 | | Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has quit [NickServ (RECOVER command used by M-E)] |
20:20 | | Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has joined #code |
22:52 | | Emmy [Emmy@Nightstar-qo29c7.fixed.kpn.net] has quit [Ping timeout: 121 seconds] |
--- Log closed Mon Sep 11 00:00:59 2023 |