All SolutionsAll Solutions

✂️

LLM Tokeniser

Week 13, 2026

Python - Rules as Functions | BMC | Python Solutions

mainText = "Let me give you one piece of advice. Be honest. He knows more than you can imagine. At last. Welcome, Neo. As you no doubt have guessed I am Morpheus. It's an honor to meet you. No the honor is mine. Please, come. Sit. I imagine that right now you're feeling a bit like Alice tumbling down the rabbit hole? You could say that. I can see it in your eyes. You have the look of a man who accepts what he sees because he's expecting to wake up. Ironically, this is not far from the truth. Do you believe in fate, Neo? No. Why not? I don't like the idea that I'm not in control of my life. I know exactly what you mean. Let me tell you why you're here. You know something. What you know, you can't explain. But you feel it. You felt it your entire life: Something's wrong with the world. You don't know what, but it's there. Like a splinter in your mind driving you mad. It is this feeling that has brought you to me. Do you know what I'm talking about? The Matrix? Do you want to know what it is? The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work when you go to church when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. What truth? That you are a slave. Like everyone else, you were born into bondage born into a prison that you cannot smell or taste or touch. A prison for your mind. Unfortunately, no one can be told what the Matrix is. You have to see it for yourself. This is your last chance. After this, there is no turning back. You take the blue pill the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill you stay in Wonderland and I show you how deep the rabbit hole goes. Remember all I'm offering is the truth. Nothing more. Follow me. Apoc, are we on-line? Almost. Time is always against us. Please take a seat there. - You did all this? - Mm-hm. The pill you took is part of a trace program. It disrupts your carrier signals so we can pinpoint your location. What does that mean? It means buckle your seat belt, Dorothy because Kansas is going bye-bye. Did you ? Have you ever had a dream, Neo, that you were so sure was real? What if you were unable to wake from that dream? How would you know the difference between the dream world and the real world? This is the Construct. It's our loading program. We can load anything, from clothing to equipment weapons training simulations anything we need. Right now we're inside a computer program? Is it really so hard to believe? Your clothes are different. The plugs in your body are gone. Your hair has changed. Your appearance now is what we call 'residual self-image.' It is the mental projection of your digital self. This isn't real? What is 'real'? How do you define 'real'? If you're talking about what you can feel what you can smell, taste and see then 'real' is simply electrical signals interpreted by your brain. This is the world that you know. The world as it was at the end of the 20th century. It exists now only as part of a neural-interactive simulation that we call the Matrix. You've been living in a dream world, Neo. This is the world as it exists today. Welcome to 'the desert of the real.' We have only bits and pieces of information. But what we know for certain is that in the early 21st century all of mankind was united in celebration. We marveled at our own magnificence as we gave birth to AI. Al. You mean artificial intelligence. A singular consciousness that spawned an entire race of machines. We don't know who struck first, us or them. But we know that it was us that scorched the sky. They were dependent on solar power and it was believed that they would be unable to survive without an energy source as abundant as the sun. Throughout human history, we have been dependent on machines to survive. Fate, it seems, is not without a sense of irony. The human body generates more bioelectricity than a 120-voIt battery. And over 25,000 BTUs of body heat. Combined with a form of fusion the machines had found all the energy they would ever need. There are fields, Neo, endless fields where human beings are no longer born. We are grown. For the longest time, I wouldn't believe it. And then I saw the fields with my own eyes watched them liquefy the dead so they could be fed intravenously to the living. And standing there, facing the pure, horrifying precision I came to realize the obviousness of the truth. What is the Matrix?" def rule1(longText) -> list[str]: ''' [Rule-1] Spaces - these should not be included in the final output ''' return longText.split(" ") def rule2_3(aWord) -> list[str]: ''' [Rule-2, 3] Prefixes: ["un", "re", "in", "dis", "pre", "mis", "ex", "anti", "inter", "sub", "over", "trans", "auto", "semi", "hyper", "multi", "co", "under", "il", "im"] Suffixes: ["ed", "ing", "ly", "ful", "less", "ness", "able", "ible", "ment", "ion", "er", "or", "ist", "ism", "ity", "ize", "ous", "ic", "al", "hood", "ship", "age"] ^ split on these as well ''' global PREFIXES, SUFFIXES tempAns = [] pFix = False word_lower = aWord.lower() # PREFIX for prefix in PREFIXES: if word_lower.startswith(prefix): pFix = True real_prefix = aWord[:len(prefix)] break # Apply prefix first if pFix: tempAns.append(real_prefix) aWord = aWord[len(real_prefix):] word_lower = aWord.lower() # SUFFIX (now based on updated word) sFix = False for suffix in SUFFIXES: if word_lower.endswith(suffix): sFix = True real_suffix = aWord[-len(suffix):] break if not(pFix) and not(sFix): return [aWord] # Apply suffix if sFix: core = aWord[:-len(real_suffix)] if core != "": tempAns.append(core) tempAns.append(real_suffix) else: if aWord != "": tempAns.append(aWord) return tempAns def rule4(aWord) -> list[str]: ''' [Rule-4] Numbers - e.g. "T800" will be split as "T", "800" ^ split on these as well ''' global NUMBERS tempAns = [] tempWordA = "" tempWordN = "" lenWord = len(aWord) numCheck = False # ensure that the word contains at least 1 NWCHAR for char in aWord: if char.isdigit(): numCheck = True break if not numCheck: return [aWord] # otherwise... for i in range(lenWord): if aWord[i].isdigit(): if tempWordA != "": tempAns.append(tempWordA) tempWordA = "" tempWordN += aWord[i] else: if tempWordN != "": tempAns.append(tempWordN) tempWordN = "" tempWordA += aWord[i] if tempWordN != "": tempAns.append(tempWordN) if tempWordA != "": tempAns.append(tempWordA) return tempAns def rule5(aWord) -> list[str]: ''' [Rule-5] Non-Word Characters - "!@#" will be split "!", "@", "#" ^ split on these as well ''' global NWCHARS tempAns = [] tempWord = "" lenWord = len(aWord) nwcCheck = False # ensure that the word contains at least 1 NWCHAR for nwc in NWCHARS: if nwc in aWord: nwcCheck = True break if not nwcCheck: return [aWord] # otherwise... for i in range(lenWord): if aWord[i] not in NWCHARS: tempWord += aWord[i] else: if tempWord != "": tempAns.append(tempWord) tempAns.append(aWord[i]) tempWord = "" if tempWord != "": tempAns.append(tempWord) return tempAns def writeFile(aList): file = open("ans.txt", 'w') for line in aList: file.write(line+'\n') file.close() # main # sorted based on length as per rule 4 PREFIXES = ['inter', 'trans', 'hyper', 'multi', 'under', 'anti', 'over', 'auto', 'semi', 'dis', 'pre', 'mis', 'sub', 'un', 're', 'in', 'ex', 'co', 'il', 'im'] SUFFIXES = ['less', 'ness', 'able', 'ible', 'ment', 'hood', 'ship', 'ing', 'ful', 'ion', 'ist', 'ism', 'ity', 'ize', 'ous', 'age', 'ed', 'ly', 'er', 'or', 'ic', 'al'] NWCHARS = ["!", "@", "#", "-", '.', "'", ',', "$", ":", "?"] ans = [] counter = 0 for word in rule1(mainText): for word2 in rule5(word): for word3 in rule4(word2): for word4 in rule2_3(word3): if word4 != "": ans.append(word4) else: counter += 1 writeFile(ans) print(f"Total tokens: {len(ans)} | white spaces: {counter}")

Token iterator and prefix/suffix matching | greenya | Odin Solutions

package main import "core:fmt" import "core:slice" import "core:strings" TEXT := #load("input.txt", string) or_else "I'll be back - T800" main :: proc () { prefixes := [?] string { "un", "re", "in", "dis", "pre", "mis", "ex", "anti", "inter", "sub", "over", "trans", "auto", "semi", "hyper", "multi", "co", "under", "il", "im" } suffixes := [?] string { "ed", "ing", "ly", "ful", "less", "ness", "able", "ible", "ment", "ion", "er", "or", "ist", "ism", "ity", "ize", "ous", "ic", "al", "hood", "ship", "age" } slice.sort_by(prefixes[:], cmp_len_abc) slice.sort_by(suffixes[:], cmp_len_abc) text_lc := strings.to_lower(TEXT) // for prefix/suffix matching it := token_iterate(TEXT) for type, token, start, end in token_next(&it) { if type != .abc { fmt.println(token) continue } prefix := "" middle_oc := token middle_lc := text_lc[start:end] suffix := "" for p in prefixes do if strings.starts_with(middle_lc, p) { prefix = middle_oc[:len(p)] middle_oc = middle_oc[len(p):] middle_lc = middle_lc[len(p):] break } for s in suffixes do if strings.ends_with(middle_lc, s) { suffix = middle_oc[len(middle_oc)-len(s):] middle_oc = middle_oc[:len(middle_oc)-len(s)] break } if prefix != "" do fmt.println(prefix) if middle_oc != "" do fmt.println(middle_oc) if suffix != "" do fmt.println(suffix) } } Token_Type :: enum { none, pun, abc, num } Token_Iterator :: struct { text: string, index: int, } token_iterate :: proc (text: string) -> Token_Iterator { return { text=text } } token_next :: proc (it: ^Token_Iterator) -> (type: Token_Type, token: string, start, end: int, ok: bool) { seq: struct { type: Token_Type, start, end: int, } loop: for ; it.index < len(it.text); it.index += 1 { i := it.index c := it.text[i] switch { case is_sep(c): if seq.type != .none { seq.end = i break loop } case is_pun(c): if seq.type == .none { seq = { type=.pun, start=i, end=i+1 } } else { seq.end = i } break loop case is_abc(c), is_num(c): c_type: Token_Type = is_abc(c) ? .abc : .num if seq.type != c_type { if seq.type == .none { seq = { type=c_type, start=i } } else { seq.end = i break loop } } } } if seq.type != .none { if seq.end == 0 { assert(it.index == len(it.text)) seq.end = it.index } if seq.end != 0 { it.index = seq.end return seq.type, it.text[seq.start:seq.end], seq.start, seq.end, true } } return } cmp_len_abc :: proc (a, b: string) -> bool { return len(a) == len(b) ? a < b : len(a) > len(b) } is_num :: proc (c: byte) -> bool { return c>='0' && c<='9' } is_abc :: proc (c: byte) -> bool { return (c>='A' && c<='Z') || (c>='a' && c<='z') } is_sep :: proc (c: byte) -> bool { return c==' ' || c=='\t' || c=='\n' || c=='\a' || c=='\r' } is_pun :: proc (c: byte) -> bool { // https://web.alfredstate.edu/faculty/weimandn/miscellaneous/ascii/ASCII%20Conversion%20Chart.gif return (c>=33 && c<=47) || (c>=58 && c<=64) || (c>=91 && c<=96) || (c>=123 && c<=126) }

Python Step by Step | Alexevi | Python Solutions

def main() -> None: prefixes = ["inter", "under", "multi", "hyper", "trans", "over", "auto", "semi", "anti", "dis", "pre", "mis", "sub", "un", "re", "in", "ex", "co", "il", "im"] suffixes = ["less", "ness", "able", "ible", "ment", "hood", "ship", "ing", "ful", "ion", "ist", "ism", "ity", "ize", "ous", "age", "ed", "ly", "er", "or", "ic", "al"] specific = ['`', '~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '-', '_', '+', '=', '[', ']', '{', '}', '\\', '|', '/', '"', '\'', ':', ';', '>', '<', ',', '.', '?'] text = "Example Text 123" result = [] for el in text.split(' '): out: list[str] = [] d = el.lower() j = 0 for i in range(len(d)): if d[i] in specific: out.append(el[j:i]) out.append(d[i]) j = i+1 if j != len(d): out.append(el[j:]) for i in range(len(out)): l = len(out[i]) index = -1 for j in range(l): if out[i][j].isdecimal() and index == -1: index = j elif index != -1 and not out[i][j].isdecimal(): out.insert(i+1, out[i][index:j]) out.insert(i+2, out[i][j:]) out[i] = out[i][:index] break elif index != -1 and j == l-1: out.insert(i+1, out[i][index:]) out[i] = out[i][:index] break shift = 0 for i in range(len(out)): skip = False if len(out[i+shift]) > 1 and not out[i+shift].isdecimal(): for pref in prefixes: a = out[i+shift][:len(pref)] if out[i+shift].lower() == pref: skip = True break if out[i+shift][:len(pref)].lower() == pref: out.insert(i+shift+1, out[i+shift][len(pref):]) out[i+shift] = a shift += 1 break if skip: break for suff in suffixes: a = out[i+shift][-len(suff):] if out[i+shift].lower() == suff: break if out[i+shift][-len(suff):].lower() == suff: out.insert(i+shift+1, a) out[i+shift] = out[i+shift][:-len(suff)] shift += 1 break for i in out: if i != '': result.append(i) for i in result: print(i) main()