This unit could be taught within a statistics unit at all curriculum levels to support students' understanding of why particular searching methods are faster than others.
Searching for a keyword, a value, or a specific piece of data (information) is the basis of many computing applications, whether it’s looking up a bank account balance, using an internet search engine, or searching for a file on your laptop. Computers deal with a lot of information so we need efficient algorithms for searching. This unit explores some common algorithms that are used to search for data on computers, with the opportunity to integrate this learning with statistics.
Because computers are often required to find information in collections of data that can be very, very large, using the right algorithm for searching is crucial! A key idea in computer science that we'll be illustrating with searching algorithms is just how fast an algorithm can be; students might think that if you're searching twice as much data then it should take twice as long, but we'll look at a way to search that takes almost the same amount of time to search 2 million items as it would for 1 million.
We'll also look at the kind of steps that algorithms use to solve the important problem of searching data. When you’re telling a computer how to find exactly what you’re looking for you need to remember that computers are simple machines, they can only look at one piece of data at a time, and check if it is what you’re searching for. Imagine if every time a computer had to search for something, it had to compare every single piece of information it contained to the information it was searching for before it found what you were looking for? It would take a very long time to find what you wanted. That’s why computer scientists need to develop quick and efficient ways of doing this.
For computer scientists finding the right algorithm requires analysing the time taken and the amount of memory used. Some great discussion questions are:
Is it always useful to sort your data so you can complete a binary search? Let’s look at a word processor, you have 100 pages of text and want to find one word. Why wouldn’t the word processor sort your words into order to find the word you are looking for?
A consideration that computer scientists take into account is how many times are you likely to search for something. If it’s only once, then it isn’t worth the effort to sort and then find the item.
This unit demonstrates two different search methods: sequential (sometimes called linear) search, and binary search. We'll see that binary search is remarkably fast, and although there are other search algorithms that are can do even better (such as the hash table, which is covered in the unit on Data Structures for search algorithms), the step-up from sequential search to binary search demonstrates how much there is to gain there is to be made by applying the right algorithm to the job.
People expect computers to find information very quickly, and the speed of search algorithms has a big influence on our experience as users. A search engine typically searches billions of web pages in less than a second, which is the kind of time-frame that humans are used to working in. We don’t like seeing the “wheel of doom” appear while the computer is processing/thinking about what to do, and if an app is too slow then people won’t want to use it.
This unit focuses on understanding two simple searching algorithms, and comparing how long they can take to reliably find the correct data (a number or word, for example).
Data: The way information is stored. Numbers are a type of data you will come across very often (e.g. account numbers, customer numbers, card numbers, serial numbers, product numbers and so on). It could also be text (such as words that we're searching for), dates, and even images and sound. You can think of anything stored in a computer's memory as data; it’s all pieces of information.
Raw data: Raw data means ‘unprocessed data’ which has been collected from a source, and has not been cleaned up or used yet. For example, say you had measured the temperature every day of the year, and now you have a bunch of numbers, this could be called your ‘unprocessed data’. If you then divided this data into each month of the year, and averaged it so you had the average temperature for each month, this would be your ‘processed data’.
Sequential search: A sequential search is when you look at each piece of data, one by one, and don’t stop until you find what you are looking for. You can use a sequential search on any data, whether it is sorted or unsorted (though it would be potentially a slow way to locate what you are looking for if the data is in sorted order!) However, sequential search is the only option you can use when you need to search through data that is unsorted. Say for example you are looking for the word “is” in the following list: “the”, “my”, "it", “is”, “said”, “a”, “from”. A computer would go to the first word “the” and ask is it the same as “is”. If it is it will stop searching, if it isn’t, it will go to the next word and repeat this process until it finds the matching data. In this case it takes 5 guesses to find the word. It could have taken up to 7 guesses or just 1 guess, depending on which word you were searching for.
Imagine our detective (below) is programmed to do a sequential search to find the word "said". Here’s the process or algorithm he would follow to find it. Notice he can only compare one word at a time!
Now imagine if he has a million words to search!
Binary search: An algorithm that tells us how to efficiently find a specific value in an ordered (sorted) list. It is called ‘binary’ search because each time you look at a value in the list you divide the list into 2 parts, one is discarded and the other is kept. The word "binary" here just means something that has two parts, such as a binary star system (made of two stars); binary search shouldn't be confused with binary numbers.
Suppose our detective is looking for the word "said" from the following list, which is in alphabetical order: “a”, “from”, “is”, “it”, “my”, “said”, “the”. First the computer will go straight to the middle word, “it” and see if that matches - because it doesn’t, the computer doesn’t need to check the 3 words to the left. Now the computer finds the middle word of the right half of the list, which is “said”. Binary search narrows in on the location of the word very quickly.
We can apply these search strategies ourselves when we are looking to find one thing. Every time we play games like “guess my number” we can apply the binary search to it. Likewise if we are searching for a book in the library that hasn’t been returned to the correct place, we’ll need to do a sequential search to find it. Searching is everywhere in our lives, from finding a person’s address to looking up a phone number in a phone book. We can apply binary or sequential search methods depending on how the data has been organised to start with.
This unit could be taught within a statistics unit at all curriculum levels to support students' understanding of why particular searching methods are faster than others.
What was most surprising about the learning that happened from the teaching of this unit?
Who were the students who were very systematic in their activities?
Who were the students who were very detailed in their activities?
What would I change in my delivery of this unit?
Throughout the lessons there are links to computational thinking. Below we've noted some general links that apply to this content.
Teaching computational thinking through CSUnplugged activities supports students to learn how to describe a problem, identify what are the important details they need to solve this problem, break it down into small logical steps so that they can then create a process which solves the problem, and then evaluate this process. These skills are transferable to any other curriculum area, but are particularly relevant to developing digital systems and solving problems using the capabilities of computers.
These Computational Thinking concepts are all connected to each other and support each other, but it’s important to note that not all aspects of Computational Thinking happen in every unit or lesson. We’ve highlighted the important connections for you to observe your students in action. For more background information on what our definition of Computational Thinking is see our notes about computational thinking.
Searching through data is something computers do all the time, just think about how often we need to search for words in documents and information on the internet! Searching efficiently is a very important problem and that computer scientists have worked on for a long time. To solve this problem we need algorithms, processes that can be followed each time we search for something, and many different ones have been developed (such as binary search). Each of these algorithms can be seen as a solution to a problem.
Students will be learning, using, and describing different algorithms throughout these lessons. They will need to follow clear instructions when doing the activities and when they are doing this they are carrying out algorithms.
The two algorithms we are looking at in this unit, sequential search and binary search, take different amounts of time to complete a search, and sequential search is less efficient than binary search.
The amount of time sequential search takes isn’t very predictable and can vary wildly, from finding something straight away, all the way to having to check every single item we have! On average it will look at half the items, which means the average time it takes is proportional to the number of items being searched; this is why it is also known as linear search, because the time needed is linear in the amount of data being searched. Binary search on the other hand is more predictable and is guaranteed to work within a small number of steps. Since it eliminates half the items we are searching every time we look at an item it works very fast, even with HUGE amounts of data! To search one billion items we would have to look at 30 of them at the very most!
If we have an abstract idea of how the searching algorithms we can use work, then we can apply them to a whole range of situations!
When searching through data it is important to only focus on the information we need to find what we are looking for. When considering general algorithms for searching, one thing we can always abstract away is the nature of the data we are searching through. When we use a searching algorithm we don’t need to know what the data represents, we just need to know what we are searching for, and if we are searching through an ordered list we also need to know how that data is ordered.
For example say we are looking for the number 160 in an unordered list; we don’t need to know whether these numbers represent heights, dollars, or bank account balances, we just need to compare the number we are searching for to the numbers in the list. Now say we are searching for the word “Max” in a list of words which is sorted into alphabetical order; we don’t need to know if we are looking at a list of pets names, or a list of random 3 letter words; we only need to know what alphabetical order means and how we use this to determine if a word is earlier or later in the alphabet. Then the same rules for searching numbers can be applied to words.
The most basic step in any searching algorithm is comparing two pieces of data and determining if they are the same or not. Each of the algorithms students will be exploring was created by decomposing the problem of searching down to make use of this very basic step. One of the most important examples of decomposition will be covered in lessons 2 and 4, the Divide and Conquer method. This method is used when we perform a binary search, which is a classic method of decomposition - you divide the list of values into two parts, and then work on each half. In this case, one of the halves will be trivial to "work on" because we know it can't contain what we're searching for, and so we don’t have to do anything with it! Breaking the problem in half repeatedly soon decomposes it to a very simple case: a list of one item, which is very easy to search!
The divide and conquer strategy is a pattern that appears frequently in computer science, and also in real life! It is an efficient and logical way of attacking many different problems where you are searching for something in a group of objects that have different identifying features. For example, if you were trying to guess your friends favourite food you could ask questions like “Is it bananas?”, “Is it chocolate?”, but this would be equivalent to doing a sequential search because with each question you can only eliminate one type of food from all the possible types of food it could be! Instead you could apply divide an conquer and ask questions like “Is it a vegetable?”, “Is it a savoury food?”. These questions eliminate many possible foods just with one question.
When doing these lessons look to see if students recognise patterns between the activities. Do they notice that searching through the unsorted number boxes is actually the same task as guessing which cup a number is in when the numbers are in a random order? Do they notice that searching the ordered number boxes is actually exactly the same as the divide and conquer activity? If they do then they might also realise that they can use the same algorithms in each of these similar activities! If we generalise our algorithms (by abstracting away unnecessary information) then we can reuse them in new situations and apply them to new, but similar, problems. They might also start recognising similar problems to these in their everyday lives and applying these strategies, because searching for things is something we all do a lot, whether we’re looking for a book on a bookshelf, or searching your wardrobe for that shirt you want to wear!
Evaluation is a key part of these lessons. There can be many different algorithms for the same problem so it is very important that we evaluate these algorithms and figure out which is the best to use in each situation. In these lessons students will look at two different algorithms for searching, each of which performs differently. For each of the search algorithms used the class will be collecting data on how long each method takes, trying to make sense of that data (for example through looking at it all on a graph), and evaluating each of the algorithms based on the data they have collected. This is one of the key points where Computational Thinking and statistics are linked, when we need to evaluate results.
By understanding the algorithms, evaluating the data collected, and thinking about how each of the different algorithms work and the differences between them, students will be exercising their logic skills a lot! Logical problem solving skills are exercised when they think about the following: