Unit 2 Definition List Heuristic - a problem solving approach (algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible. Lossless Compression - a data compression algorithm that allows the original data to be perfectly reconstructed from the compressed data. Lossy Compression - (or irreversible compression) a data compression method that uses inexact approximations, discarding some data to represent the content. Most commonly seen in image formats like Image - A type of data used for graphics or pictures. Metadata - is data that describes other data. For example, a digital image my include metadata that describe the size of the image, number of colors, or resolution. Pixel - short for "picture element" it is the fundamental unit of a digital image, typically a tiny square or dot which contains a single point of color of a larger image. RGB - the RGB color model uses varying intensities of (R)ed, (G)reen, and (B)lue light are added together in to reproduce a broad array of colors. Abstraction - Pulling out specific differences to make one solution work for multiple problems. Aggregation - a computation in which rows from a data set are grouped together and used to compute a single value of more significant meaning or measurement. Common aggregations include: Average, Count, Sum, Max, Median, etc. Pivot Table - in most spreadsheet software it is the name of the tool used to create summary tables. Summary Table - a table that shows the results of aggregations performed on data from a larger data set, hence a "summary" of larger data. Spreadsheet software typically calls them "pivot tables". Unit 2 Review Guide Bytes: A byte is the standard fundamental unit (or “chunk size”) underlying most computing systems today. You may have heard “megabyte”, “kilobyte”, “gigabyte”, etc. which are all different amounts of a bytes. We’re going to learn more about them today. **Recall that a single character of ASCII text requires 8 bits. The technical term for 8 bits of data is a Byte. Modern data files typically measure in the thousands, millions, billions or trillions of bytes. The size of information in the computer is measured in kilobytes, megabytes, gigabytes, and terabytes. One kilobyte (KB) is a collection of about 1000 bytes. A typical short email would also take up just 1 or 2 kilobytes. One megabyte is about 1 million bytes (or about 1000 kilobytes). An MP3 audio file would typically take up a few megabytes. Image from a digital camera would typically take up few megabytes. One gigabyte (GB) is about 1 billion bytes, or 1 thousand megabytes. A DVD movie is roughly 48 GB. A flash memory card used in a camera might store 16 GB. One terabyte (TB) is about 1000 gigabytes, or roughly 1 trillion bytes. External hard drives range from 500 GB – 2TB typically. File Sizes and Compression: Data file size can grow very quickly in size. We sometimes want to compress data in order to save time and space. Larger file require more disk space on your hard drive and also take longer to share over the internet. There is an upper limit to how fast bits can be transmitted over the Internet so to combat that we try and compress data. The art and science of compression is about figuring out how to represent the SAME DATA with FEWER BITS. There are two categories of compression techniques; Lossy Compression and Lossless Compression Lossless Compression allows the original data to be perfectly reconstructed from the compressed data. Lossy Compression gets rid of some of the data forever and you lose some of the quality of any image, audio, or video file. This can be useful because humans can’t always recognize minimal losses in quality. It’s mostly used in visual or audio formats where a loss in precision is undetectable to human eyes and ears. Lossy compression schemes are ones in which “useless” or less-than-totally-necessary information is thrown out in order to reduce the size. Lossless compression is sometimes achieved by finding patterns within data. When discovering patterns in data you may discover more and more patterns based on your initial finding. If you start off with a different starting pattern in the data, it will lead you to entirely new patterns, but which way is best for compression. The answer is you’ll never know unless every possible set of combinations possible. There is hope though and that hope is called a heuristic. A heuristic is a problem solving approach (typically an algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible. Sometimes good enough works. There is no real way to determine for sure that you’ve got the best compression besides trying everything possible by brute force. Heuristics are techniques (list of steps) for at least making progress toward a “good enough” solution. Heuristics are not perfect solutions. There is no way to prove a technique is best without trying all possibilities by brute force. This is an example of an algorithm that cannot run in a “reasonable amount of time”. Algorithms like these should not be considered for too much thought. Images and Metadata: All computer images and videos are made up of pixels. Images on computer screens are created with light by illuminating pixels on the screen. Your computer screen probably has over a million of these tiny pixels all over your screen which display all the images based of the color light the pixels emit. Screen resolution is the number of pixels and how they are arranged vertically and horizontally, and density is the number of pixels per a given area. To create images on your screen we have to tell the pixels what colors to emit and we accomplish through binary. But images need more than just the binary data relating to what color pixels should be. Images need data which describes other properties of itself like the width and height of the image. Images require metadata. Metadata is data that describes other data. For example, a digital image my include metadata that describe the size of the image, number of colors, or resolution. Metadata has also been seen used in the Transmission Control Protocol (TCP) where larger messages are broken down into smaller packets and given metadata that contains the ordering information of packets for re-creation of the original message. Black and white images are not too difficult to encode since you only have to represent two colors in binary. A 1 can represent a black pixel and a 0 can represent a white pixel. The metadata would describe the dimensions of the image and there you have all the data required to create a B&W image. One problem though, the world is full of color. We have to be creative in coming up with a technique to represent color images in binary. The way color is represented in a computer is different from the ways we represented text or numbers. With text, we just made a list of characters and assigned a number to each one. With color, we actually use binary to encode the physical phenomenon of LIGHT. Each pixel on your screen is made up of three tiny lights which are red, green, and blue. By changing the intensities of the red, green, and blue lights we can create a large variety of color depending on how many bits we preserve to represent the intensity of each color. Typically color pixels are represented using 24 bits, 8 for Red, 8 for Green, and 8 for Blue. 00000000 (0 DEC) is the darkest intensity for a color 11111111 (255 DEC) is the brightest intensity for a particular color Using different intensities for Red, Green, and Blue can create a ton of different colors. One color pixel can be represented in binary as so: 11111100 00101111 01010101 (BIN) or 252 47 85 (DEC). Which means there is a lot of red, a little bit of green, and some blue illuminating inside the pixel. We can represent this binary information in a much simpler fashion, hexadecimal. Instead of writing 11111111 00001010 00000000 (BIN), you can write FF 0A 00 (HEX). Much easier for a human to encode color data. Every 4 bits in binary can be interpreted as one hex digit. Note this is not a form of compression just a different representation. Image filters are created by adjusting RGB values by some function (adding or subtracting a number to the different color intensities). Math is applied to RGB values to create different photo filters. Digital images are just data (lots of data) composed of layers of abstraction: pixels on your screen made up of red, green, and blue lights whose intensities are based off an 8 bit binary number(Darkest -> Brightest). Almost anything can be represented in data if you break it down enough so that it can be represented as binary information. You can represent anything with text or numbers if broken down enough. First we learned to develop binary numbers, then ASCII text (using 8 bit binary numbers), then formatted text (using special patterns in ASCii) , and finally color images. Highlevel encodings are actually quite removed from the underlying bits from which they are made. In the world of computer science, we call this abstraction - a mental tool that allows us to ignore low-level details when they are unnecessary. This ability to ignore small details is what allows us to develop complex encodings and protocols. Everything is represented as 1s and 0s but sometimes you don’t always have to acknowledge that fact to get certain tasks accomplished. Data: Typically don’t have to break digital data down all the way to bits in order to work with it, but understanding that digital data at its root is just bits gives you insights into working with larger data sets. People generate data all the time by visiting their favorite social media sites, performing google searches, watching Netflix, swiping their credit card, etc. Data is being generated by everyday actions people make but who owns this data and where is it stored? Companies that own the applications and services you use every day own the data you generate. They store this data on servers which are physically located somewhere on Earth. Companies like Facebook and Instagram don’t store all the data they gathered in one location. Instead they make many copies and store them in many different locations, so that if one of the data centers ends up corrupted they have backups in place so that things will continue to run smoothly. This is the concept of redundancy. Data can also be generated by simply asking people for the data directly either through online surveys or using other online tools. These tools try to capture people’s responses to things because the data, in aggregate (all together), might contain useful information that could be extracted. There is lots of data gathered by individuals and organizations, which makes it possible to compute with/on and find interesting stories, patterns, or trends. When analyzing a large aggregation of data it is sometimes useful to use a tool that helps use visualize this data. Looking at a Spreadsheet of data can be helpful but it isn’t great for noticing patterns or trends occurring within the data set. Tools like Google Trends allow us to visualize the popularity of internet searches vs. other search topics. Visualization tools make it much easier to understand and interpret data easier than just simply looking at the raw data itself. Visualizations allow for better communication of information/story to audiences. Be careful when looking at visualizations! Draw a distinction between describing what the data shows and describing why it might be that way. Be aware of the assumptions you are making when looking at visualizations for they can be misleading. For example, using google Trends and the search term Suicide, there was a spike in searches relating to Suicide in the August 2016. Some assumptions that were made were that school is starting up again and that’s why there was a spike in the search term. Instead a popular movie called “Suicide Squad” came out in August 2016 and was most likely the reason there was such a spike in searches for the word Suicide. Be careful with your assumptions. Analyzing and interpreting data will typically require some assumptions to be made about the accuracy of the data and the cause of the relationships observed within it. When decisions are made based on a collection of data, they will often rest just as much on that set of assumptions about the data as the data itself. Identifying and validating (or disproving) assumptions is therefore an important part of data analysis. Furthermore, clear communication about how data was interpreted should also include an account of the assumptions made along the way. If our assumptions are wrong, our entire analysis may be wrong as well. Digital Divide Access and use of the Internet differs by income, race, education, age, disability, and geography. As a result, some groups are over- or under-represented when looking at activity online. When we see behavior on the Internet, like search trends, we may be tempted to assume that access to the Internet is universal and so we are taking a representative sample of everyone. In reality, a “digital divide” leads to some groups being over- or under-represented. Some people may not be on the Internet at all. Be aware of this when you are trying to interpret reality over the internet. Data Visualizations An important skill is the ability to critically evaluate information. As our world is increasingly filled with data, more and more the information from that data is conveyed through visualizations. Visualization is useful for both discovery of connections and trends and also communication. Computing has enabled massive amounts of information to be automatically collected, aggregated, analyzed, and visualized. Visualizations are useful in helping humans understand large amounts of data quickly, and they are useful communication tools when presenting findings about a collection of data. However not all visualizations are created equal, however, and in many cases the type of visualization used may distract or even mislead the reader. Some advantages in using visualizations pictures allow you to compare things more easily, easier to see trends or patterns, can focus on, or highlight, particular aspects of the data that are important. While some disadvantages are that while using visualizations it is easy to mislead or miscommunicate. They can remove details that might be important or valuable, sometimes very dense and it may possibly takes a while to study to understand what it means. Good visualizations are simple and easy to read while bad visualizations are complicated, can have confusing images or colors, too much text, etc. Choosing the right way to visualize data is essential to communicating your ideas. There are stories in data; visualization helps you tell them. Before understanding visualizations, you must understand the types of data that can be visualized and their relationships to each other. Certain chart types are right for certain situations, depending on the data. “Data Visualizations 101” is an excellent guide on how to create nice visualizations. Using tools like Google Spreadsheets or MS Excel can help visualize raw data into different types of charts and graphs (scatter, line, bar). Any computer scientist working with data should have some skills and facility with producing visualizations of the data to get a sense of what it contains. Taking data from its raw state to the point where you can create a meaningful visualization involves several steps. Practice with different visualization tools is required. The goal with creating visualizations isn’t about creating the prettiest chart but the chart that makes the most sense for the data you’ve got. The data has stories within that can be found. Finding “no correlation” or “no relationship” is actually just as interesting as finding a strong correlation or relationship. Sometimes before analyzing and performing computations on raw data to create Summary Tables or Visualization, the data must be first “cleaned”. Using computational tools to analyze data has made it much easier to find trends and patterns in large datasets. When preparing data for this kind of analysis, however, it’s important to remember that the computer is much less “intelligent” than we might imagine. Small discrepancies in the data may prevent accurate interpretation of trends and patterns or can even make it impossible to use the data in computation in the first place. Cleaning data is therefore an important step in analyzing it, and in many contexts, it may actually take the largest amount of time. When we collect data, it’s usually “dirty,” which means that, for one reason or another, it’s not ready for analysis. To get what we want we may have to manipulate the data ourselves so that we can get the computer to work on it. Web Links: Text Compression Widget: https://www.youtube.com/watch?v=LCGkcn1f-ms&feature=youtu.be Images, Pixels and RGN: https://www.youtube.com/watch?v=15aqFQQVBWU&feature=youtu.be Data Visualization 101: How to design charts and graphs http://content.visage.co/hs-fs/hub/424038/file-2094950163-pdf
© Copyright 2026 Paperzz