Making machines, not humans, do the work

Melody Lee
ESTL Lab Notes
Published in
5 min readMar 30, 2022

--

Introduction

As mentioned in our last post, OneSchoolBus (OSB) aims to make school transportation safer and smarter, but to do that, we needed a better understanding of the school bus ecosystem — which meant we needed data.

We sat down to assess the state of the data available to schools, as well as the processes that schools and operators undertook to obtain this data.

What’s happening now

What we found out: once a year, at around March, schools requested a list of students, their monthly bus fees, and the capacity of their buses from bus operators. Bus operators would fill in the provided template; some operators even did it on pen and paper.

After operators submit the information, admin managers (AM) of each school would have to match the submitted student list to their list of school students, a tedious process that takes time and significant effort because the students’ names from operators were not exact matches to their birth certificate names. Furthermore, AMs were not best-placed to know which pupils were taking school buses, as operators liaised with parents directly.

After the students are matched, the system has access to the students’ residential addresses, which MOE uses to approximate stop addresses for further data analysis like the distance between the school and the stop.

It was clear to us that we needed to get more data, such as pick-up and drop-off timings and bus routes, for more comprehensive data analysis. At the same time, we needed to be mindful of bus operators’ workloads, especially given their tight profit margins; this meant not increasing their admin workloads unduly.

OSB needed to be as painless as possible for our main data producers, the bus operators, as well as school AMs, who already had plenty on their plates.

Why are names improperly added?

When parents register for school bus services, they might not give bus operators the official full names of their children. We looked at the data from the pilot schools we worked with and identified a few common reasons for student name mismatch:

  • Typos
  • Omitted or extra spaces or punctuation between name parts
  • Wrong order of name parts, eg Xiao Ming Tan instead of Tan Xiao Ming
  • Abbreviations, eg Mohd
  • Partial names, eg Mary Tan instead of Mary Tan Hui Ling. Names with aliases are especially prone to this issue.

Do we even need student identities?

We first went back to the drawing board to figure out if we needed exact student identity. If we collected stop addresses, which we intended, we would not need each student’s residential address.

Ultimately, the case for matching students was stronger:

  • Operational needs. When we correctly identify a student, we have access to information such as their class. A possible case: if a bus driver can’t find a student during drop-off, the receptionist in the general office will be able to quickly ask the student’s teacher about their status, with the class information on hand. When the student promotes, we can automatically update the student’s class without more data entry by operators.
  • Potential for integration with other apps. For instance, Parents’ Gateway is another app by ESTL used for communication between teachers and parents. What if parents could use the app to check if their children are safely on the way home?

Challenges

In an ideal world, parents would register with the full official name of their children and bus operators would key them in without mistakes. This is not the world we live in.

Our job would be much easier if we could obtain the student’s Birth Certificate number, since we could use that for a unique match within the school’s student list.

Alternatively, we could have given bus operators an auto-complete drop-down list with all the school’s students, so that they could identify the correct student themselves.

Unfortunately, as bus operators are private organisations, they should not have access to BC numbers in accordance with the Personal Data Protection Act, or names of students other that what they collect from their direct customers.

Letting the algorithm do the work

Our problem was a string similarity problem, constrained by the list of students in the school. We needed an algorithm that we could be confident would find the correct student in MOE’s database if the name added by the operator was similar enough to the actual student’s name.

We looked at 3 algorithms:

  1. Levenshtein: No. of edits (adding, removing, or changing) it takes to get one string equal to another.
    Eg, it takes 1 edit to get Tan Huiling to Tan Hui Ling, the 1 edit being the addition of a space.
  2. Jaccard: Amount of intersection between 2 strings broken into tokens.
    Eg, Amelia Tan is 67% similar to Tan Yuxi Amelia (2 parts out of a total of 3 parts are the same).
  3. Letter Pairs/Dice’s Coefficient: After breaking down both strings into pairs of letters, calculate the number of common pairs over the total number of pairs.
    Eg, FRANCE: {FR, RA, AN, NC, CE}
    FRENCH: {FR, RE, EN, NC, CH}
    FR and NC are the common pairs, which yields a commonality of 4 / 10 (40%).
An illustration of Dice’s Coefficient. (Source)

We compared the performance of each algorithm for about 500 students added by one of our bus operators for 2021. Our aim was for no false positives, ie, that the algorithm would not match a wrong student to a name. Of course, the stricter the threshold was, the lower the rate of matches.

Here are our results:

Dice’s Coefficient far outperformed the other two algorithms with a 86% true positive rate. Furthermore, we discovered another lever for even better performance: if we sort the matches by similarity based on Dice’s Coefficient, and set the difference between first and second match as a minimum of 10%, we could achieve a 98.8% true positive and 0% false positive rate even without implementing a minimum similarity threshold. For a bus operator or school with 500 students, that’s a difference of 50 students the algorithm can find without additional work.

For the remaining names, we ask operators for the last 4 characters of the student’s birth certificate number to find a match. Among the 1300 students of the school, about 10% of partial NRIC/FINs are shared by more than 1 student, with a maximum of 3 students having the same partial NRIC/FIN. This means that we can afford to lower the criteria of the algorithm as it is extremely unlikely that two students with the same partial NRIC/FIN share a very similar name.

The algorithm is currently in place for the 8 pilot schools on OSB, and the results play out: about 2% of names require additional input from operators, whether in adding the partial NRIC/FIN or editing the student’s name. Compare this to previously, when only about 55% of names are exact matches, and the school administrator has to match the remaining 45%.

Designing for the real world

It is often tempting to expect ideal behaviour from users, causing great frustration when they have to go back and forth to conform to the system’s expectations.

We believe instead in designing a system that takes into account the messiness of the real world.

If you share this belief and want to find out more about what we do, drop us a mail at hello@estl.edu.sg!

--

--