You are working for your government who want to create their own LLM, rather than being reliant on external models from other organisations or countries. The first step they have tasked you with is creating a simple tokeniser that will be required for both input data during training and when the user submits their query to the LLM
A tokeniser takes input strings, then breaks it up into meaningful sub-strings that are easier to analyse/process, rather than operating on the entire input
The rules are as follows - input should be split on:
Spaces - these should not be included in the final output
The longest prefix/suffix should be matched first - e.g. "interpreted" would be "inter", "pret", "ed", rather than "in", "terpret", "ed"
Numbers - e.g. "T800" will be split as "T", "800"
Non-Word Characters - "!@#" will be split "!", "@", "#"
Note: the suffix and prefix matches should be case-insensitive
You can also assume all characters will be valid standard ASCII characters - e.g. English alphabet, numbers, spaces, common punctuation etc
Some full input strings and an array representing their tokenisation result can be seen below:
Note: the test cases are output as arrays for brevity - for the real output, each token should be on its own line and not inside quotes
interpreted
["inter", "pret", "ed"]
anti-gravity
["anti", "-", "grav", "ity"]
I'll be back - T800
["I", "'", "ll", "be", "back", "-", "T", "800"]
All those moments will be lost in time, like tears in rain.
["All", "those", "moments", "will", "be", "lost", "in", "time", ",", "like", "tears", "in", "rain", "."]
I'm sorry, Dave, I'm afraid I can't do that.
["I", "'", "m", "sorry", ",", "Dave", ",", "I", "'", "m", "afraid", "I", "can", "'", "t", "do", "that", "."]
The three laws of robotics: 1. A robot may not injure a human being or, through inaction, allow a human being to come to harm. 2. A robot must obey the orders given it by human beings, except where such orders would conflict with the First Law. 3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.
["The", "three", "laws", "of", "robotics", ":", "1", ".", "A", "robot", "may", "not", "in", "jure", "a", "human", "be", "ing", "or", ",", "through", "in", "act", "ion", ",", "allow", "a", "human", "be", "ing", "to", "co", "me", "to", "harm", ".", "2", ".", "A", "robot", "must", "obey", "the", "orders", "given", "it", "by", "human", "beings", ",", "ex", "cept", "where", "such", "orders", "would", "co", "nflict", "with", "the", "First", "Law", ".", "3", ".", "A", "robot", "must", "protect", "its", "own", "ex", "istence", "as", "long", "as", "such", "protect", "ion", "does", "not", "co", "nflict", "with", "the", "First", "or", "Second", "Laws", "."]
Any sufficiently advanced technology is indistinguishable from magic.
["Any", "sufficient", "ly", "advanc", "ed", "technology", "is", "in", "distinguish", "able", "from", "mag", "ic", "."]
The Matrix is everywhere. It is all around you.
["The", "Matrix", "is", "everywhere", ".", "It", "is", "all", "around", "you", "."]
The text you should tokenise is shown below:
Let me give you one piece of advice. Be honest. He knows more than you can imagine. At last. Welcome, Neo. As you no doubt have guessed I am Morpheus. It's an honor to meet you. No the honor is mine. Please, come. Sit. I imagine that right now you're feeling a bit like Alice tumbling down the rabbit hole? You could say that. I can see it in your eyes. You have the look of a man who accepts what he sees because he's expecting to wake up. Ironically, this is not far from the truth. Do you believe in fate, Neo? No. Why not? I don't like the idea that I'm not in control of my life. I know exactly what you mean. Let me tell you why you're here. You know something. What you know, you can't explain. But you feel it. You felt it your entire life: Something's wrong with the world. You don't know what, but it's there. Like a splinter in your mind driving you mad. It is this feeling that has brought you to me. Do you know what I'm talking about? The Matrix? Do you want to know what it is? The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work when you go to church when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. What truth? That you are a slave. Like everyone else, you were born into bondage born into a prison that you cannot smell or taste or touch. A prison for your mind. Unfortunately, no one can be told what the Matrix is. You have to see it for yourself. This is your last chance. After this, there is no turning back. You take the blue pill the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill you stay in Wonderland and I show you how deep the rabbit hole goes. Remember all I'm offering is the truth. Nothing more. Follow me. Apoc, are we on-line? Almost. Time is always against us. Please take a seat there. - You did all this? - Mm-hm. The pill you took is part of a trace program. It disrupts your carrier signals so we can pinpoint your location. What does that mean? It means buckle your seat belt, Dorothy because Kansas is going bye-bye. Did you ? Have you ever had a dream, Neo, that you were so sure was real? What if you were unable to wake from that dream? How would you know the difference between the dream world and the real world? This is the Construct. It's our loading program. We can load anything, from clothing to equipment weapons training simulations anything we need. Right now we're inside a computer program? Is it really so hard to believe? Your clothes are different. The plugs in your body are gone. Your hair has changed. Your appearance now is what we call 'residual self-image.' It is the mental projection of your digital self. This isn't real? What is 'real'? How do you define 'real'? If you're talking about what you can feel what you can smell, taste and see then 'real' is simply electrical signals interpreted by your brain. This is the world that you know. The world as it was at the end of the 20th century. It exists now only as part of a neural-interactive simulation that we call the Matrix. You've been living in a dream world, Neo. This is the world as it exists today. Welcome to 'the desert of the real.' We have only bits and pieces of information. But what we know for certain is that in the early 21st century all of mankind was united in celebration. We marveled at our own magnificence as we gave birth to AI. Al. You mean artificial intelligence. A singular consciousness that spawned an entire race of machines. We don't know who struck first, us or them. But we know that it was us that scorched the sky. They were dependent on solar power and it was believed that they would be unable to survive without an energy source as abundant as the sun. Throughout human history, we have been dependent on machines to survive. Fate, it seems, is not without a sense of irony. The human body generates more bioelectricity than a 120-voIt battery. And over 25,000 BTUs of body heat. Combined with a form of fusion the machines had found all the energy they would ever need. There are fields, Neo, endless fields where human beings are no longer born. We are grown. For the longest time, I wouldn't believe it. And then I saw the fields with my own eyes watched them liquefy the dead so they could be fed intravenously to the living. And standing there, facing the pure, horrifying precision I came to realize the obviousness of the truth. What is the Matrix?
Your program should tokenise the text above and output each token on a new line and NOT inside quotes:
Hints
Hints will be released at the start of each of the following days - e.g. the start of day 3 is 48 hours after the challenge starts
Release Day
Hint
2
Splitting the string by spaces is probably a good place to start
3
You might then want to split the tokens by non-word characters (punctuation) and numbers - the regex classes \W and \d might come in useful for this. Remember, these punctuations symbols & numbers should be retained
5
You can check if each split string starts with one of the prefixes - if it does, add that prefix to the token list and get the remaining substring without the prefix included
6
Then check if the remaining substring ends with one of the suffixes - if not, add the entire substring as a token, while if it does, add part of the substring without the suffix, then the suffix to token list