Data — AP CS Principles CSP Study Guide
For: AP CS Principles candidates sitting AP Computer Science Principles.
Covers: All required AP CED Data subtopics: binary numbers and abstraction, lossy vs lossless data compression, data analysis and visualization, and information privacy and security.
You should already know: No prior CS required.
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP CS Principles style for educational use. They are not reproductions of past College Board papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official College Board mark schemes for grading conventions.
1. What Is Data?
Data is any sequence of symbols, measurements, or observations that can be interpreted to convey meaning. In digital systems, all data is stored as sequences of binary digits at the lowest hardware level. This unit is a core pillar of the AP CS Principles CED, as data underpins every other area of the course, from programming to the social impacts of computing. Common synonyms for data include digital information, raw datasets, and structured input. Unlike analog data (such as a physical vinyl record or a handwritten letter), digital data can be easily copied, modified, transmitted, and analyzed using computing systems.
2. Binary numbers and abstraction
Binary is a base-2 number system that uses only two digits: 0 and 1, called bits (short for binary digits). All digital hardware uses binary because electronic circuits only have two stable states: on (representing 1) and off (representing 0). Every type of digital data—numbers, text, images, audio, video, and program code—is encoded as a sequence of bits.
Abstraction is the process of hiding low-level implementation details to focus on higher-level functionality, and it is the reason you never have to interact with raw binary directly to use a computer. For example, when you open a JPEG image, your operating system abstracts the millions of underlying 0s and 1s into a viewable photo, hiding the binary encoding entirely.
Key binary rules and conversions
To convert a binary number to decimal (base-10, the standard human number system), sum the product of each bit and its corresponding power of 2, starting from the rightmost bit which has a position value of : where is the number of bits.
Worked example 1: Binary to decimal
Convert the 8-bit binary number to decimal:
- List the position values from right to left:
- Multiply each bit by its position value:
- Sum the results:
To convert a decimal number to 8-bit binary, subtract the largest possible power of 2 from the number repeatedly, marking a 1 for each power that fits and 0 for those that do not.
Worked example 2: Decimal to binary
Convert to 8-bit binary:
- Largest power of 2 less than 73 is 64 (): mark 1, remainder
- Next power 32 does not fit in 9: mark 0
- Next power 16 does not fit: mark 0
- Next power 8 fits: mark 1, remainder
- Powers 4 and 2 do not fit: mark 0 for both
- Power 1 fits: mark 1
- Fill leading 0 to make 8 bits:
Exam tip: Examiners frequently test that you understand all data types are stored as binary, not just numbers. For example, the ASCII encoding standard maps every text character to an 8-bit binary value: the letter 'A' is encoded as , equal to .
3. Data compression — lossy vs lossless
Data compression is the process of reducing the size of a file by encoding its data more efficiently. Smaller files take up less storage space and transfer faster over the internet, making compression critical for streaming services, cloud storage, and web browsing. There are two core compression types tested on the AP CSP exam:
Lossless compression
Lossless compression reduces file size without removing any data, so the original file can be perfectly reconstructed when decompressed. It is used for files where even small amounts of data loss would make the file unusable, such as text documents, medical images, editable design files, and software executables. Common lossless algorithms include ZIP, PNG (images), and FLAC (audio).
A simple lossless algorithm is run-length encoding (RLE), which replaces repeated sequences of data with a count and the value. For example, a row of black-and-white pixel data BBBBBWWWWBBBB (13 characters) is encoded as 5B4W4B (6 characters), reducing size by 54% with no data loss.
Lossy compression
Lossy compression permanently removes non-critical data to achieve much smaller file sizes than lossless compression can deliver. The removed data is usually imperceptible to humans, so quality loss is unnoticeable for most consumer use cases, but the original file can never be restored from the compressed version. It is used for streaming video (Netflix, YouTube), MP3 audio, and social media JPEG images, where smaller file size is more important than perfect fidelity.
For example, a 12MB raw camera photo can be compressed to a 900KB JPEG with almost no visible quality difference for mobile or social media use.
Exam tip: A common exam question asks you to select the correct compression type for a scenario. Remember: choose lossless if you need the exact original file for editing or legal purposes, and lossy if small file size is the highest priority.
4. Data analysis and visualisation
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful insights, draw conclusions, and support decision-making. The standard AP CSP data analysis workflow is:
- Collect data: Gather data from surveys, sensors, public government datasets, or program outputs
- Clean data: Remove invalid, duplicate, or incomplete entries (e.g., a survey response claiming 25 hours of daily screen time is an invalid entry to discard)
- Process data: Calculate metrics such as mean, median, mode, and correlation between variables
- Communicate insights: Use data visualization to present findings clearly
Data visualization is the graphical representation of data to make patterns, trends, and outliers far easier to identify than in raw spreadsheets. The four most common visualization types tested on the exam are:
- Bar charts: Compare values across discrete categories (e.g., average test scores by grade level)
- Line charts: Show changes in a variable over time (e.g., monthly rainfall over a year)
- Scatter plots: Show the relationship between two continuous variables (e.g., hours of study vs test score)
- Pie charts: Show proportional shares of a whole (e.g., percentage of budget allocated to each department)
Worked example: Data visualization
A student collects data on daily temperature and ice cream sales for a local shop over 30 days. A scatter plot with temperature on the x-axis and sales on the y-axis shows a strong positive correlation: as temperature increases, ice cream sales tend to rise. This pattern would be nearly impossible to spot by looking at the raw table of 30 temperature and sales entries.
Exam tip: A common trap is assuming correlation equals causation. The positive correlation between temperature and ice cream sales means the variables are related, but it does not prove that higher temperature causes higher sales (though in this case, the causal link is logical). For less obvious correlations, such as a link between screen time and lower test scores, confounding variables like sleep quality could explain the relationship.
5. Information privacy and security
Information privacy is the right of individuals to control how their personal data is collected, used, stored, and shared by computing systems. The most important term in this unit is PII (Personally Identifiable Information), which refers to any data that can be used to identify a specific individual, including full name, address, phone number, email, location history, biometric data (fingerprints, face scans), and social security number. PII is protected by global regulations like the EU GDPR and California CCPA, which require companies to disclose how they use user data and allow users to request deletion of their data.
Core security measures to protect data
- Encryption: Scrambles data so it can only be read with a secret decryption key, even if it is intercepted during transmission
- Authentication: Verifies user identity with passwords, two-factor authentication (2FA), or biometric scans to prevent unauthorized access to accounts
- Anonymization: Removes all PII from datasets so individuals cannot be identified, used for research datasets that are shared publicly
Common risks to data privacy include data breaches (unauthorized access to company user databases), phishing (fake emails/websites pretending to be legitimate services to steal PII), and tracking by advertisers and social media platforms that collect user behavior data to serve targeted ads.
Exam tip: You will frequently be asked to evaluate tradeoffs between convenience and privacy. For example, using a free map app that tracks your location gives you real-time traffic updates, but it also means the app company stores a permanent record of every place you visit.
6. Common Pitfalls (and how to avoid them)
- Wrong move: Assuming binary is only used to store numbers. Why: Students first learn binary as a number system, so they forget it encodes all digital data types. Correct move: Explicitly state that text, images, audio, and program code are all stored as binary using standardized encoding systems like ASCII and RGB.
- Wrong move: Choosing lossy compression for editable design files or medical images. Why: Students prioritize smaller file size and forget lossy compression causes permanent data loss. Correct move: Always select lossless compression for use cases where you need the exact original file, even if the compressed file is larger.
- Wrong move: Claiming a correlation between two variables proves causation. Why: Students see a clear pattern in a visualization and assume a direct causal link. Correct move: Note that correlation only shows a relationship, and controlled experiments are required to prove causation, as confounding variables may explain the pattern.
- Wrong move: Confusing bits and bytes for file size and internet speed calculations. Why: The terms are similar, and internet speeds are listed in bits per second while file sizes are listed in bytes. Correct move: Remember , so a 100 Mbps (megabit per second) internet connection downloads a 100 MB (megabyte) file in ~8 seconds, not 1 second.
- Wrong move: Assuming anonymized data can never be linked back to an individual. Why: Students think removing obvious PII like names makes data fully anonymous. Correct move: Note that anonymized data can often be re-identified by combining it with other public datasets, so extra security measures are still required for sensitive datasets.
7. Practice Questions (AP Computer Science Principles Style)
Question 1
Convert the decimal number to an 8-bit binary number. Show your work.
Solution
- List powers of 2 for 8-bit values, from highest to lowest:
- : mark 0 for the position
- : mark 1 for , remainder
- : mark 1 for , remainder
- : mark 0 for
- : mark 1 for , remainder
- : mark 1 for , remainder
- : mark 0 for
- : mark 1 for
Final 8-bit binary value:
Question 2
A small business needs to send high-resolution raw product photos to a graphic designer, who will edit the photos and return final compressed versions for use on the company website. Which compression type should the business use to send the original photos to the designer, and why? A) Lossy compression, because it will make the file size smaller for faster transfer B) Lossy compression, because the quality loss will not be visible on the website C) Lossless compression, because the designer needs the original full-quality file to edit D) Lossless compression, because it will produce a smaller file than lossy compression
Solution
Correct answer: C Explanation: Lossy compression permanently removes image data, which would reduce the quality available for the designer to edit. Lossless compression reduces file size without losing any data, so the designer gets the full original quality required for edits. Option A is incorrect because even though lossy compression makes files smaller, the permanent data loss makes it unsuitable for editable source files. Option B is incorrect because while final web images may use lossy compression, the original source files need full quality for editing. Option D is incorrect because lossy compression always produces smaller files than lossless compression for the same content.
Question 3
A researcher collects data on the number of ice cream cones sold per day and the number of drowning deaths per day in a coastal city, and finds a strong positive correlation between the two variables. Which of the following is the most accurate conclusion? A) Eating ice cream causes people to drown B) Drowning deaths cause people to buy more ice cream C) A third variable, such as daily temperature, explains the correlation D) The correlation is a coincidence, there is no relationship between the variables
Solution
Correct answer: C Explanation: Correlation does not equal causation. On hotter days, more people buy ice cream, and more people swim, leading to higher drowning deaths. Temperature is the confounding variable that explains the observed correlation. Options A and B incorrectly assume a direct causal relationship between the two measured variables. Option D is incorrect because there is a indirect relationship between the variables, driven by temperature.
8. Quick Reference Cheatsheet
| Concept | Key Facts |
|---|---|
| Binary Numbers | Base-2 system using 0/1 (bits). 8 bits = 1 byte. All digital data is stored as binary. Binary to decimal: (rightmost bit = ) |
| Abstraction | Hides low-level binary details to present data in human-usable formats (images, text, apps) |
| Lossless Compression | No data lost, original file fully reconstructible. Use cases: text, medical images, editable files. Algorithms: ZIP, PNG, FLAC |
| Lossy Compression | Permanent data removal for smaller file sizes, original cannot be restored. Use cases: streaming, JPEG, MP3 |
| Data Visualization | Bar charts (category comparison), line charts (change over time), scatter plots (correlation), pie charts (proportions). Correlation ≠ causation |
| Privacy & Security | PII = Personally Identifiable Information. Protection measures: encryption, 2FA, anonymization. Tradeoffs exist between convenience and privacy |
9. What's Next
The data concepts you learned here are foundational to every other unit in the AP CSP syllabus. When you study programming, you will use variables to store and manipulate binary data to build functional apps. When you study the internet unit, you will learn how compressed data is transmitted between devices across global networks, and how encryption protects that data in transit. When you complete your Create Performance Task, you will use data analysis and visualization to communicate the findings of your original program to graders. Understanding data privacy and security will also help you answer questions about the social impacts of computing, which make up ~20% of the multiple choice exam.
If you have any questions about binary conversion, compression tradeoffs, data visualization best practices, or any other data topic for AP CSP, you can ask Ollie, our AI tutor, for personalized explanations and extra practice problems at any time. You can also find more study guides for other AP CSP topics on the homepage to build your full exam preparation toolkit.