Build Week 5: Data Science Clandestine Op
Build Weeks are a part of the rhythm of the yearly calendar at Long-View, opening our schedule up to allow us to dive into special activities and challenges. Build Weeks help us grow intellectually, help us make connections between disciplines, give us a chance to break the “routine” of school life, and give us an opportunity to try new things.
Build Week #5 involved an undercover op led by senior CIA agents but executed by Long-View junior CIA agents. We first found out that our math teacher, Mr. Moore, who is (suspiciously) a very private person, actually had a prior career within the CIA. Because of his prior relationship and due to a particularly pressing problem his ex-colleagues were tackling, a CIA unit based in Austin reached out to Agent Ames (aka Mr. Moore) to engage the Long-View students in an effort to leverage their stellar computer science skills. The Austin-based unit knew that no one would suspect a school and thus they could accomplish more -- working in collaboration with our students -- than they could alone. Time was of the essence as the CIA had tracked a nefarious group named DART that was working to take control of the world's satellites and shut them down. This could prove disastrous to all nations. With this information, we launched into Build Week #5.
Once all of our junior agents at Long-View had been tagged for security clearance, we assembled into 7 units based on regions of the world (from Lynx Unit focused on Asia to Ranger Unit focused on Antarctica) and began listening in on a data source (a bug that was planted recently) with a goal of seeing whether the information coming in from the bug could offer enough clues to create a Most Wanted List to pass along to CIA Headquarters in Langley, VA.
The first data set the Long-View units received was relatively limited. The data set included 40 data points of client IP addresses, each with a requested server's geographic coordinates. Junior agents manually tracked each data point on regional maps (utilizing knowledge of coordinate planes, ordered pairs, and latitude/longitude). Junior agents also learned about the pandas library for Python to prepare for an influx of data.
The data set grew exponentially over the next days, and thus the junior agents quickly realized that they would not be able to rely on manual methods to examine the data. They worked in Jupyter Notebooks and used Python functions from a plotting library to map approximately 40,000 servers. They learned about k-means clustering to locate centers of apparently normal activity. They compared results and engaged in discussions that touched on areas of data science as they explored patterns, clustering, averaging, centroids, and outliers. The junior agents made conjectures about the largest cluster in their regional data set and approximated the centroid. Then, they located outliers, the servers farthest from that centroid, using the Pythagorean theorem, i.e., the distance formula.
As the data sets grew to 2 million client IP addresses, the junior agents developed a more sophisticated understanding of outlier identification. The most suspicious activity is connected to servers farthest from any centroid. Their new understanding, as well as their coding skills, were put to the test as they worked to figure out:
Which servers were suspicious based on its geographic coordinates
Which client IP addresses were accessing the suspicious servers
Coding work in Python traversed these areas:
using variables, lists, and data frames
defining and calling functions
calculating distance
using the pandas library
data science topics, including data manipulation of rows/columns, calculating mean, outlier detection, and k-means clustering
The junior agents compiled lists of their unit's top most suspicious IP addresses and shared them with the other regional units. We then worked as a global team to identify patterns in the client IP addresses. One of the senior agents worked quickly to identify the locations from which the client IPs originated within the patterns we noticed. One was the National History Museum in Bulgaria (perhaps a good place to locate a secret operation), another was the DART Container Company (another great potential place to hide), as well as several others: The Tojikiston Hotel, a remote location in Antarctica, and a company called “DRT Strategies.” The final IP address took us all by surprise...it was our own building! We had a double agent on our hands!
A last bit of analysis, based on the frequency of IP’s across all the lists generated by all units and the number of continents in which the IP’s appeared, led us to collectively ascertain the top most suspicious. Thus, we sent the senior agents off to Tajikistan to do some covert work at The Tojikiston Hotel (location of the IP address 109.74.75.56), as it was the most likely headquarters of DART. Build Week concluded on a high note and we are certain the world is safer now thanks to the thoughtful work of our agents/computer analysts.