What Does Data Smell Like? The Anatomy of Internet Traffic

Max Finn
12 min readOct 12, 2020

--

Photo by okeykat on Unsplash. If you really could smell data, dogs would be pros at cyber security!

If you’re reading this, you’re probably an avid user of computers and the internet(this article assumes you have some knowledge of how computers work, know Python, and are not feint of heart). As a CS student, computers are all I know and the connection to the internet is essential for what I do. For many, it supplements their free time and I daresay replaces certain social interactions that typically shouldn’t be overlooked. After an epiphany that I was like this before the pandemic, it brought me to realize another fault: that I was an ignorant consumer. I didn’t know how data was being shipped over the network to and from my devices. For someone as reliant on computers and the internet as myself, I should really be more aware of where my 3 A.M. Youtube searches go and who could gain access to my search history. This brought panic, as I’m sure it has for others. Who else could know about my weird obsession with seeing pimples professionally popped? These urgent questions needed to be answered, so in true computer science student fashion, I said ‘maybe tomorrow’ and went to sleep.

New days bring new questions. I have a basic understanding of how packets go through multiple different routes to get to their destination, providing a strong fault tolerance. This is called redundancy, overlapping purpose so that if one fails, there’s more to pick up the slack. However, I had no idea what those packets were and the ways they were organized. The first step was to get familiar with the territory, and what better way to learn than to hack it?

So I started making a packet sniffer, a program that would allow me to eavesdrop in on data traveling across a network I had access to. Through this project, I would start learning the fundamentals of data transfer over networks and how they would potentially be compromised. Before any of that though, I’ll explain what sniffing is and why I choose it for my project so you can decide if it’s right for you to learn.

Packet Sniffing

Packet sniffing is the act of reading data packets as they flow across the network. You find these packets by accessing the Network Interface Controller(NIC) — typically a router or switch — and run it in promiscuous mode, which means that it passes all the data through the NIC’s CPU instead of just certain frames like usual. This creates a single point where we can read all of the data going through the network. From here we decode that data into its components to get information like the content, destination, and source.

This project is only possible if we understand exactly how the data is being transferred, and by nature teaches us how to identify the symptoms of a sniffing attack. This is why I choose this project, as it is an excellent starting point into studying this domain, and a tangible example of how we as programmers can control more of the computers and networks we use. Not only is this knowledge necessary in defense against it, but there are positive benefits to sniffing your own network like detecting bottlenecks and bandwidth hogs. Finally, this even comes with a disclaimer: This requires linux because of the packages we use, and free professional-level programs exist for this already. Wireshark and NetworkMiner include a sniffer as well as a slew of other useful network testing tools, so this article is a project recap purely for understanding, and you can continue to just read along if you’d like.

Your Network and You:

Everyone always asks: “Where is wifi?” and “What happened to the wifi?”, but no one asks “How is wifi?”. We need to understand the components of a data packet and how it goes from source to destination in order to intercept it. These are decided by protocols, ‘rules’ put in place so that data is uniformly formatted and received so that it can be properly read. The one we’ll be dissecting is the most commonly used on the web: The TCP/IP protocol.

The Transmission Control Protocol/Internet Protocol is how we describe the four layers that make up the standard of transferring data between computers. They are:

  • Data Link layer — Controls the physical aspect of data transmission, whether it be ethernet, wireless routing, NIC’s and more.
  • Internet layer — Controls how and where packets are sent to on the network
  • Transport layer — Basically a handshake, this establishes that each side is receiving the other’s packages.
  • Application layer — This is what handles user interactions, like with email and messaging. The other layers deal with the details of the communication, so this is the only layer that needs network access.

First, we find where the data is meant to go. This is done through reading the IP address, which can be compared to a physical street address, just for the internet. Your computer, phone and all other devices have their own IP addresses, which is how the network knows which device has which connection so it can send it the right data. The TCP encompasses everything after that, and is the equivalent to the transportation that gets you to that address. They are separate, but because one is basically useless without the other, so the “TCP/IP Model” is recognized terminology. The data is then divided into packets and sent through each layer before being dropped at the designated IP address. If our program is at work, it will have made a ‘pit stop’ at the router so that it could be read and printed to us in human readable form.

We’re finally ready to start. We still have to learn the anatomy of the packets, but we can do that along the way. Be warned, this is where we get technical because we have to fit this in one article. If you’re not familiar with this at all, it will take time, but you’re not alone! I went into this blind and will provide everything I used to understand what was going on, but you’ll have some homework to do. This might look intimidating really quickly, but I’ve done my best to make it digestible, and after at least one read-through you should have enough information to sound cool at parties.

https://www.thegeekstuff.com/2012/03/ip-protocol-header/

Remember those layers from earlier? This is the data they receive at each layer. From here we can focus on what we want first: the ethernet frame in the data link layer. This will get us the data itself, source and destination MAC of the data(so we know which of the computers on the network are making the request), and the protocol type so we know how to read it. So, let’s look at the frame:

https://www.ionos.com/digitalguide/server/know-how/ethernet-frame/

Again, this may look intimidating, but for now we only need to look for what we need and not worry about what anything is doing(The link under the picture has a complete rundown). Lucky for us, that’s neatly contained in the ethernet frame and easy to get. For this tutorial, I’ll be using the Sockets package to unpack the frame, along with Struct and Textwrap.

If you notice, at the end of the return statement we get all of the bytes after the first 14. This is the payload, or the data that we will decrypt based on type. If you were to make a search on a website, say Reddit, their site would be the destination and your computer would be the source of the search. The payload contains the query made to their site, the actual words you typed in. If we’re receiving packets from everyone, this will show us which computer is going to which sites, and what they’re doing there. Pretty creepy, right? We’ll get into this more in the defense section, so you know more about when you’re vulnerable.

Just a few more things to know before we move on. In ethernet_frame(), in the comment above the return statement I mention that the socket function, htons(), is converting the data based on endianness. This is simply the way the computer system organizes data, either Big or Little Endian, which you can learn more about here. We’ve also added a simple helper function to organize the MAC addresses into a human-readable(and standard six groups of two hexadecimal digits) format.

Next, we’ll actually capture the traffic coming across the network to feed our program. We look for open sockets if they exist, which allows us to listen to a network. This is why you’re not supposed to do anything personal on public Wi-fi, because they allow anyone to access them and their traffic. After the connection is established, we listen, then process the data as it floats across the network and format it as we print for our humans eyes to read:

We’ve gotten the data we need, it’s up to us now to break it down and read it properly.

Ever seen the movie inception? We’re now entering dream-within-a-dream territory, except it’s data-within-data. The data itself… is a packet! So we’ve opened one box to find another, and we might even have to go deeper. But for now, we have enough variety to keep us busy, because this isn’t just any old ethernet frame: we’ve gotten to the IPv4 packet. This is the format typical internet traffic is sent in, and as you’ve probably guessed contains a bit more than just the data itself. We target is the header, which follows suite with everything else branches off into a truly unreasonable amount of directions:

https://www.thegeekstuff.com/2012/03/ip-protocol-header/

Don’t panic. Again, we can cherry-pick what we want and leave the rest, and the link under the picture has extra homework if you’re going for a gold star. For now, just understand that it is an array of bytes and you’ll be fine. We need six things from this monster:

  • Version: IPv4 most commonly, but could use IPv6. Others are not used, defined by the two bytes of ‘type’ from the ethernet frame earlier.
  • Header Length: Used to determine at which byte the data starts.
  • Time To Live(TTL): The effective lifetime of the packet.
  • Protocol: The transport layer protocol.
  • Source: Where did he come from
  • Target(destination): Where does he go

Now how do we break these down, Cotton-Eye Joe?

Sockets got you covered. We just need to break it down so everything goes into their proper variables. We get the version and header length with the full data, and bit shift it by four to get the version by itself. The next part is a bit tricky, and requires an explanation. The version header length represents the amount of data in the whole packet, but there’s not always enough room to store that number of bytes(anywhere between 46 and 65535), so we offset it. With the bitwise operator &, we can read how much of the storage is actually being used. As for multiplying it by four, I found this great great quote from a book on the subject:

“The size of the Header Length or the IHL field is 4 bits. The Header Length field is used to specify the length of header, which can range from 20 to 60 bytes. You must multiply the value in this field by four to get the length of the IP header. For example, if the value in this field is 3, the length of the header is 3*4, which is 12 bytes.”

We have the destination, the source, the data and type. Depending on the data type we’re dealing with, we’ll have to handle this differently. After that, it’s just a matter of formatting and displaying the data. We’re almost there! I’ll quickly go through the types and how to handle them:

TCP/HTTP:

http://mars.netanya.ac.il/~unesco/cdrom/booklet/HTML/NETWORKING/node040.html

There’s a lot going on, but once again, we only need some of it. The usual source and destination, the sequence, the acknowledgements, and the flags. Getting the first things is expectedly easy, and we can use bit shifting for that group of flags, represented by three letters each.

UDP:

Finally, an easy one(I think you guys can guess what’s going on here at this point):

ICMP:

The last one! Unfortunately, another anti-climatically easy one:

We’re over the hump. After finishing the main function and some optional formatting setup, you’re finally ready to sniff some data.

Here’s the breakdown for the main function(which you can find in my repo, along with the formatting):

  1. We unpack the ethernet frame, and check the ethernet protocol to make sure it returns eight, signifying IPv4, as all others aren’t useful to us.
  2. Based on the protocol type from the IPv4 packet, we process this data. TCP can include extra HTTP data, so we watch out for that.
  3. We print the data in formatted form, in order to understand what’s going on a bit easier.
  4. We’re finally done! Time to start thinking about the next project…

Running The Thing

Big disclaimer for this one. This is not meant to be run in places where you don’t have permission, and the consequences for doing so are steep. Real penetration-testing experts attempt to hide their digital footprint, which I make no attempt at here. That’s a tutorial for another day, but the point is that this program is not for real use.

As mentioned earlier, some of the packages in this tutorial require Linux to run. I used VirtualBox VM to run a virtual instance of Linux, and found a handy pre-set Kali Linux machine from Offensive Security. If you get an error about a USB 2.0 port, simply go to the machine’s settings, go into ports and disable USBs. Once in, download the code from a repository and run sudo python3 (filename).

I know, it was a lot. But you made it! I hope through this exercise in vanity that you too have accomplished what we set out to do: to understand what we work with daily a bit better, and feel a little less helpless. On that front, this tutorial has taught me a lot:

  • Don’t ever login/do anything private over public networks. Someone is one tutorial away from being able to read and keep that data.
  • Look for promiscuous mode: The easiest way to tell if a network is being sniffed.
  • Always use HTTPS. This confirms that you’re getting the data that you’re meant to, and can’t be subject to MITM attacks. For more context, go here.

Through understanding and practice comes mastery, and mastery over your data is nothing to scoff at in a world where our lives are more data than not. We eat, sleep and breath the internet, wanting to consume more and more of something that most do not truly understand, leaving them floating through a big ocean filled with predators. I hope through this that you’ve gained the knowledge to keep yourself safe, and feel a little more powerful over your virtual world.

You can find my repo for this project here: https://github.com/HexSeal/PySniffer

This was helped in no small part by this tutorial: https://www.youtube.com/watch?v=WGJC5vT5YJo&list=PL6gx4Cwl9DGDdduy0IPDDHYnUx66Vc4ed&index=1

Sources not already in the article:

https://www.informit.com/articles/article.aspx?p=29578&seqNum=5#:~:text=The%20size%20of%20the%20Header,length%20of%20the%20IP%20header

--

--

Max Finn
Max Finn

Written by Max Finn

I'm a passionate backend engineer writing about my code projects so that I can make it a little easier on myself(and hopefully you) later.

No responses yet