[Python] Python - Rules as Functions[Python] Python - Rules as Functions
✂️
LLM Tokeniser
Week 13, 2026
mainText = "Let me give you one piece of advice. Be honest. He knows more than you can imagine. At last. Welcome, Neo. As you no doubt have guessed I am Morpheus. It's an honor to meet you. No the honor is mine. Please, come. Sit. I imagine that right now you're feeling a bit like Alice tumbling down the rabbit hole? You could say that. I can see it in your eyes. You have the look of a man who accepts what he sees because he's expecting to wake up. Ironically, this is not far from the truth. Do you believe in fate, Neo? No. Why not? I don't like the idea that I'm not in control of my life. I know exactly what you mean. Let me tell you why you're here. You know something. What you know, you can't explain. But you feel it. You felt it your entire life: Something's wrong with the world. You don't know what, but it's there. Like a splinter in your mind driving you mad. It is this feeling that has brought you to me. Do you know what I'm talking about? The Matrix? Do you want to know what it is? The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work when you go to church when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. What truth? That you are a slave. Like everyone else, you were born into bondage born into a prison that you cannot smell or taste or touch. A prison for your mind. Unfortunately, no one can be told what the Matrix is. You have to see it for yourself. This is your last chance. After this, there is no turning back. You take the blue pill the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill you stay in Wonderland and I show you how deep the rabbit hole goes. Remember all I'm offering is the truth. Nothing more. Follow me. Apoc, are we on-line? Almost. Time is always against us. Please take a seat there. - You did all this? - Mm-hm. The pill you took is part of a trace program. It disrupts your carrier signals so we can pinpoint your location. What does that mean? It means buckle your seat belt, Dorothy because Kansas is going bye-bye. Did you ? Have you ever had a dream, Neo, that you were so sure was real? What if you were unable to wake from that dream? How would you know the difference between the dream world and the real world? This is the Construct. It's our loading program. We can load anything, from clothing to equipment weapons training simulations anything we need. Right now we're inside a computer program? Is it really so hard to believe? Your clothes are different. The plugs in your body are gone. Your hair has changed. Your appearance now is what we call 'residual self-image.' It is the mental projection of your digital self. This isn't real? What is 'real'? How do you define 'real'? If you're talking about what you can feel what you can smell, taste and see then 'real' is simply electrical signals interpreted by your brain. This is the world that you know. The world as it was at the end of the 20th century. It exists now only as part of a neural-interactive simulation that we call the Matrix. You've been living in a dream world, Neo. This is the world as it exists today. Welcome to 'the desert of the real.' We have only bits and pieces of information. But what we know for certain is that in the early 21st century all of mankind was united in celebration. We marveled at our own magnificence as we gave birth to AI. Al. You mean artificial intelligence. A singular consciousness that spawned an entire race of machines. We don't know who struck first, us or them. But we know that it was us that scorched the sky. They were dependent on solar power and it was believed that they would be unable to survive without an energy source as abundant as the sun. Throughout human history, we have been dependent on machines to survive. Fate, it seems, is not without a sense of irony. The human body generates more bioelectricity than a 120-voIt battery. And over 25,000 BTUs of body heat. Combined with a form of fusion the machines had found all the energy they would ever need. There are fields, Neo, endless fields where human beings are no longer born. We are grown. For the longest time, I wouldn't believe it. And then I saw the fields with my own eyes watched them liquefy the dead so they could be fed intravenously to the living. And standing there, facing the pure, horrifying precision I came to realize the obviousness of the truth. What is the Matrix?"
def rule1(longText) -> list[str]:
'''
[Rule-1]
Spaces - these should not be included in the final output
'''
return longText.split(" ")
def rule2_3(aWord) -> list[str]:
'''
[Rule-2, 3]
Prefixes: ["un", "re", "in", "dis", "pre", "mis", "ex", "anti", "inter", "sub", "over", "trans", "auto", "semi", "hyper", "multi", "co", "under", "il", "im"]
Suffixes: ["ed", "ing", "ly", "ful", "less", "ness", "able", "ible", "ment", "ion", "er", "or", "ist", "ism", "ity", "ize", "ous", "ic", "al", "hood", "ship", "age"]
^ split on these as well
'''
global PREFIXES, SUFFIXES
tempAns = []
pFix = False
word_lower = aWord.lower()
# PREFIX
for prefix in PREFIXES:
if word_lower.startswith(prefix):
pFix = True
real_prefix = aWord[:len(prefix)]
break
# Apply prefix first
if pFix:
tempAns.append(real_prefix)
aWord = aWord[len(real_prefix):]
word_lower = aWord.lower()
# SUFFIX (now based on updated word)
sFix = False
for suffix in SUFFIXES:
if word_lower.endswith(suffix):
sFix = True
real_suffix = aWord[-len(suffix):]
break
if not(pFix) and not(sFix):
return [aWord]
# Apply suffix
if sFix:
core = aWord[:-len(real_suffix)]
if core != "":
tempAns.append(core)
tempAns.append(real_suffix)
else:
if aWord != "":
tempAns.append(aWord)
return tempAns
def rule4(aWord) -> list[str]:
'''
[Rule-4]
Numbers - e.g. "T800" will be split as "T", "800"
^ split on these as well
'''
global NUMBERS
tempAns = []
tempWordA = ""
tempWordN = ""
lenWord = len(aWord)
numCheck = False
# ensure that the word contains at least 1 NWCHAR
for char in aWord:
if char.isdigit():
numCheck = True
break
if not numCheck: return [aWord]
# otherwise...
for i in range(lenWord):
if aWord[i].isdigit():
if tempWordA != "":
tempAns.append(tempWordA)
tempWordA = ""
tempWordN += aWord[i]
else:
if tempWordN != "":
tempAns.append(tempWordN)
tempWordN = ""
tempWordA += aWord[i]
if tempWordN != "": tempAns.append(tempWordN)
if tempWordA != "": tempAns.append(tempWordA)
return tempAns
def rule5(aWord) -> list[str]:
'''
[Rule-5]
Non-Word Characters - "!@#" will be split "!", "@", "#"
^ split on these as well
'''
global NWCHARS
tempAns = []
tempWord = ""
lenWord = len(aWord)
nwcCheck = False
# ensure that the word contains at least 1 NWCHAR
for nwc in NWCHARS:
if nwc in aWord:
nwcCheck = True
break
if not nwcCheck: return [aWord]
# otherwise...
for i in range(lenWord):
if aWord[i] not in NWCHARS:
tempWord += aWord[i]
else:
if tempWord != "": tempAns.append(tempWord)
tempAns.append(aWord[i])
tempWord = ""
if tempWord != "": tempAns.append(tempWord)
return tempAns
def writeFile(aList):
file = open("ans.txt", 'w')
for line in aList:
file.write(line+'\n')
file.close()
# main
# sorted based on length as per rule 4
PREFIXES = ['inter', 'trans', 'hyper', 'multi', 'under', 'anti', 'over', 'auto', 'semi', 'dis', 'pre', 'mis', 'sub', 'un', 're', 'in', 'ex', 'co', 'il', 'im']
SUFFIXES = ['less', 'ness', 'able', 'ible', 'ment', 'hood', 'ship', 'ing', 'ful', 'ion', 'ist', 'ism', 'ity', 'ize', 'ous', 'age', 'ed', 'ly', 'er', 'or', 'ic', 'al']
NWCHARS = ["!", "@", "#", "-", '.', "'", ',', "$", ":", "?"]
ans = []
counter = 0
for word in rule1(mainText):
for word2 in rule5(word):
for word3 in rule4(word2):
for word4 in rule2_3(word3):
if word4 != "":
ans.append(word4)
else:
counter += 1
writeFile(ans)
print(f"Total tokens: {len(ans)} | white spaces: {counter}")