AP CSP Unit 2 Review Guide_Definition List

Unit 2 Definition List

Heuristic - a problem solving approach (algorithm) to find a satisfactory solution where
finding an optimal or exact solution is impractical or impossible.

Lossless Compression - a data compression algorithm that allows the original data to be
perfectly reconstructed from the compressed data.

Lossy Compression - (or irreversible compression) a data compression method that uses
inexact approximations, discarding some data to represent the content. Most commonly
seen in image formats like

Image - A type of data used for graphics or pictures.

Metadata - is data that describes other data. For example, a digital image my include
metadata that describe the size of the image, number of colors, or resolution.

Pixel - short for "picture element" it is the fundamental unit of a digital image, typically
a tiny square or dot which contains a single point of color of a larger image.

RGB - the RGB color model uses varying intensities of (R)ed, (G)reen, and (B)lue light are
added together in to reproduce a broad array of colors.

Abstraction - Pulling out specific differences to make one solution work for multiple
problems.

Aggregation - a computation in which rows from a data set are grouped together and
used to compute a single value of more significant meaning or measurement. Common
aggregations include: Average, Count, Sum, Max, Median, etc.

Pivot Table - in most spreadsheet software it is the name of the tool used to create
summary tables.

Summary Table - a table that shows the results of aggregations performed on data from
a larger data set, hence a "summary" of larger data. Spreadsheet software typically calls
them "pivot tables".
Unit 2 Review Guide
Bytes:
A byte is the standard fundamental unit (or “chunk size”) underlying most computing systems
today. You may have heard “megabyte”, “kilobyte”, “gigabyte”, etc. which are all different
amounts of a bytes. We’re going to learn more about them today.
**Recall that a single character of ASCII text requires 8 bits. The technical term for 8 bits of data
is a Byte.
Modern data files typically measure in the thousands, millions, billions or trillions of bytes. The
size of information in the computer is measured in kilobytes, megabytes, gigabytes, and
terabytes.
One kilobyte (KB) is a collection of about 1000 bytes. A typical short email would also take up
just 1 or 2 kilobytes.
One megabyte is about 1 million bytes (or about 1000 kilobytes). An MP3 audio file would
typically take up a few megabytes. Image from a digital camera would typically take up few
megabytes.
One gigabyte (GB) is about 1 billion bytes, or 1 thousand megabytes. A DVD movie is roughly 48 GB. A flash memory card used in a camera might store 16 GB.
One terabyte (TB) is about 1000 gigabytes, or roughly 1 trillion bytes. External hard drives
range from 500 GB – 2TB typically.
File Sizes and Compression:
Data file size can grow very quickly in size. We sometimes want to compress data in order to
save time and space. Larger file require more disk space on your hard drive and also take longer
to share over the internet. There is an upper limit to how fast bits can be transmitted over the
Internet so to combat that we try and compress data.
The art and science of compression is about figuring out how to represent the SAME DATA with
FEWER BITS. There are two categories of compression techniques; Lossy Compression and
Lossless Compression
Lossless Compression allows the original data to be perfectly reconstructed from the
compressed data.
Lossy Compression gets rid of some of the data forever and you lose some of the quality of any
image, audio, or video file. This can be useful because humans can’t always recognize minimal
losses in quality. It’s mostly used in visual or audio formats where a loss in precision is
undetectable to human eyes and ears. Lossy compression schemes are ones in which “useless”
or less-than-totally-necessary information is thrown out in order to reduce the size.
Lossless compression is sometimes achieved by finding patterns within data. When discovering
patterns in data you may discover more and more patterns based on your initial finding. If you
start off with a different starting pattern in the data, it will lead you to entirely new patterns,
but which way is best for compression. The answer is you’ll never know unless every possible
set of combinations possible. There is hope though and that hope is called a heuristic.
A heuristic is a problem solving approach (typically an algorithm) to find a satisfactory solution
where finding an optimal or exact solution is impractical or impossible. Sometimes good
enough works.
There is no real way to determine for sure that you’ve got the best compression besides trying
everything possible by brute force. Heuristics are techniques (list of steps) for at least making
progress toward a “good enough” solution.
Heuristics are not perfect solutions. There is no way to prove a technique is best without trying
all possibilities by brute force. This is an example of an algorithm that cannot run in a
“reasonable amount of time”. Algorithms like these should not be considered for too much
thought.
Images and Metadata:
All computer images and videos are made up of pixels. Images on computer screens are
created with light by illuminating pixels on the screen. Your computer screen probably has over
a million of these tiny pixels all over your screen which display all the images based of the color
light the pixels emit. Screen resolution is the number of pixels and how they are arranged
vertically and horizontally, and density is the number of pixels per a given area.
To create images on your screen we have to tell the pixels what colors to emit and we
accomplish through binary. But images need more than just the binary data relating to what
color pixels should be. Images need data which describes other properties of itself like the
width and height of the image. Images require metadata.
Metadata is data that describes other data. For example, a digital image my include metadata
that describe the size of the image, number of colors, or resolution. Metadata has also been
seen used in the Transmission Control Protocol (TCP) where larger messages are broken down
into smaller packets and given metadata that contains the ordering information of packets for
re-creation of the original message.
Black and white images are not too difficult to encode since you only have to represent two
colors in binary. A 1 can represent a black pixel and a 0 can represent a white pixel. The
metadata would describe the dimensions of the image and there you have all the data required
to create a B&W image. One problem though, the world is full of color. We have to be creative
in coming up with a technique to represent color images in binary.
The way color is represented in a computer is different from the ways we represented text or
numbers. With text, we just made a list of characters and assigned a number to each one. With
color, we actually use binary to encode the physical phenomenon of LIGHT. Each pixel on your
screen is made up of three tiny lights which are red, green, and blue. By changing the intensities
of the red, green, and blue lights we can create a large variety of color depending on how many
bits we preserve to represent the intensity of each color.
Typically color pixels are represented using 24 bits, 8 for Red, 8 for Green, and 8 for Blue.


00000000 (0 DEC) is the darkest intensity for a color
11111111 (255 DEC) is the brightest intensity for a particular color
Using different intensities for Red, Green, and Blue can create a ton of different colors. One
color pixel can be represented in binary as so: 11111100 00101111 01010101 (BIN) or
252 47 85 (DEC). Which means there is a lot of red, a little bit of green, and some blue
illuminating inside the pixel.
We can represent this binary information in a much simpler fashion, hexadecimal. Instead of
writing 11111111 00001010 00000000 (BIN), you can write FF 0A 00 (HEX). Much easier for a
human to encode color data. Every 4 bits in binary can be interpreted as one hex digit. Note this
is not a form of compression just a different representation.
Image filters are created by adjusting RGB values by some function (adding or subtracting a
number to the different color intensities). Math is applied to RGB values to create different
photo filters.
Digital images are just data (lots of data) composed of layers of abstraction: pixels on your
screen made up of red, green, and blue lights whose intensities are based off an 8 bit binary
number(Darkest -> Brightest).
Almost anything can be represented in data if you break it down enough so that it can be
represented as binary information. You can represent anything with text or numbers if broken
down enough. First we learned to develop binary numbers, then ASCII text (using 8 bit binary
numbers), then formatted text (using special patterns in ASCii) , and finally color images. Highlevel encodings are actually quite removed from the underlying bits from which they are made.
In the world of computer science, we call this abstraction - a mental tool that allows us to
ignore low-level details when they are unnecessary. This ability to ignore small details is what
allows us to develop complex encodings and protocols. Everything is represented as 1s and 0s
but sometimes you don’t always have to acknowledge that fact to get certain tasks
accomplished.
Data:
Typically don’t have to break digital data down all the way to bits in order to work with it, but
understanding that digital data at its root is just bits gives you insights into working with larger
data sets.
People generate data all the time by visiting their favorite social media sites, performing google
searches, watching Netflix, swiping their credit card, etc. Data is being generated by everyday
actions people make but who owns this data and where is it stored?
Companies that own the applications and services you use every day own the data you
generate. They store this data on servers which are physically located somewhere on Earth.
Companies like Facebook and Instagram don’t store all the data they gathered in one location.
Instead they make many copies and store them in many different locations, so that if one of the
data centers ends up corrupted they have backups in place so that things will continue to run
smoothly. This is the concept of redundancy.
Data can also be generated by simply asking people for the data directly either through online
surveys or using other online tools. These tools try to capture people’s responses to things
because the data, in aggregate (all together), might contain useful information that could be
extracted. There is lots of data gathered by individuals and organizations, which makes it
possible to compute with/on and find interesting stories, patterns, or trends.
When analyzing a large aggregation of data it is sometimes useful to use a tool that helps use
visualize this data. Looking at a Spreadsheet of data can be helpful but it isn’t great for noticing
patterns or trends occurring within the data set. Tools like Google Trends allow us to visualize
the popularity of internet searches vs. other search topics. Visualization tools make it much
easier to understand and interpret data easier than just simply looking at the raw data itself.
Visualizations allow for better communication of information/story to audiences.
Be careful when looking at visualizations! Draw a distinction between describing what the data
shows and describing why it might be that way. Be aware of the assumptions you are making
when looking at visualizations for they can be misleading. For example, using google Trends and
the search term Suicide, there was a spike in searches relating to Suicide in the August 2016.
Some assumptions that were made were that school is starting up again and that’s why there
was a spike in the search term. Instead a popular movie called “Suicide Squad” came out in
August 2016 and was most likely the reason there was such a spike in searches for the word
Suicide. Be careful with your assumptions.
Analyzing and interpreting data will typically require some assumptions to be made about the
accuracy of the data and the cause of the relationships observed within it. When decisions are
made based on a collection of data, they will often rest just as much on that set of assumptions
about the data as the data itself. Identifying and validating (or disproving) assumptions is
therefore an important part of data analysis. Furthermore, clear communication about how
data was interpreted should also include an account of the assumptions made along the way. If
our assumptions are wrong, our entire analysis may be wrong as well.
Digital Divide
Access and use of the Internet differs by income, race, education, age, disability, and
geography. As a result, some groups are over- or under-represented when looking at activity
online. When we see behavior on the Internet, like search trends, we may be tempted to
assume that access to the Internet is universal and so we are taking a representative sample of
everyone. In reality, a “digital divide” leads to some groups being over- or under-represented.
Some people may not be on the Internet at all. Be aware of this when you are trying to
interpret reality over the internet.
Data Visualizations
An important skill is the ability to critically evaluate information. As our world is increasingly
filled with data, more and more the information from that data is conveyed through
visualizations. Visualization is useful for both discovery of connections and trends and
also communication. Computing has enabled massive amounts of information to be
automatically collected, aggregated, analyzed, and visualized. Visualizations are useful in
helping humans understand large amounts of data quickly, and they are useful communication
tools when presenting findings about a collection of data.
However not all visualizations are created equal, however, and in many cases the type of
visualization used may distract or even mislead the reader. Some advantages in using
visualizations pictures allow you to compare things more easily, easier to see trends or
patterns, can focus on, or highlight, particular aspects of the data that are important. While
some disadvantages are that while using visualizations it is easy to mislead or miscommunicate.
They can remove details that might be important or valuable, sometimes very dense and it may
possibly takes a while to study to understand what it means.
Good visualizations are simple and easy to read while bad visualizations are complicated, can
have confusing images or colors, too much text, etc.
Choosing the right way to visualize data is essential to communicating your ideas. There are
stories in data; visualization helps you tell them. Before understanding visualizations, you must
understand the types of data that can be visualized and their relationships to each other.
Certain chart types are right for certain situations, depending on the data. “Data Visualizations
101” is an excellent guide on how to create nice visualizations.
Using tools like Google Spreadsheets or MS Excel can help visualize raw data into different
types of charts and graphs (scatter, line, bar). Any computer scientist working with data should
have some skills and facility with producing visualizations of the data to get a sense of what it
contains. Taking data from its raw state to the point where you can create a meaningful
visualization involves several steps. Practice with different visualization tools is required.
The goal with creating visualizations isn’t about creating the prettiest chart but the chart that
makes the most sense for the data you’ve got. The data has stories within that can be found.
Finding “no correlation” or “no relationship” is actually just as interesting as finding a strong
correlation or relationship.
Sometimes before analyzing and performing computations on raw data to create Summary
Tables or Visualization, the data must be first “cleaned”. Using computational tools to analyze
data has made it much easier to find trends and patterns in large datasets. When preparing
data for this kind of analysis, however, it’s important to remember that the computer is much
less “intelligent” than we might imagine. Small discrepancies in the data may prevent accurate
interpretation of trends and patterns or can even make it impossible to use the data in
computation in the first place. Cleaning data is therefore an important step in analyzing it, and
in many contexts, it may actually take the largest amount of time. When we collect data, it’s
usually “dirty,” which means that, for one reason or another, it’s not ready for analysis. To get
what we want we may have to manipulate the data ourselves so that we can get the computer
to work on it.
Web Links:
Text Compression Widget:
https://www.youtube.com/watch?v=LCGkcn1f-ms&feature=youtu.be
Images, Pixels and RGN:
https://www.youtube.com/watch?v=15aqFQQVBWU&feature=youtu.be
Data Visualization 101: How to design charts and graphs
http://content.visage.co/hs-fs/hub/424038/file-2094950163-pdf