Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add std.zip and support zip files in build.zig.zon #19729

Merged
merged 1 commit into from May 3, 2024

Conversation

marler8997
Copy link
Contributor

@marler8997 marler8997 commented Apr 22, 2024

fixes #17408

Comments Addressed

  • added file verification for security based on yauzl
  • added option for backslash normalization (default is to reject filenames with them, build.zig.zon enables this option)
  • prevent zip-bomb by exiting decompression loop if it gets larger than specified
  • we now assert an error if we detect encryption (instead of misinterpreting encryption section as data)
  • write the temporary zip file to a separate place so we don't have to fixup the extracted files afterwards
  • ability to extract zip from memory
  • able to decompress all zip files under "success" directory in yauzl test directory (except ones with encryption)
  • detect and reject all unexpected data (excludes extension fields)
  • detect and propagate to src/Package/Fetch when there is a single root directory

Improvements:

  • more tests around bad zip files
  • deflate64 (requires std library work to enhance our current deflate implementation to support deflate64)
  • improve std.zip.Diagnostics with nice error messages like std.tar
  • implement more extensions like "override the file name from the Info-ZIP Unicode Path Extra Field (0x7075)" or preserving unix file permissions

@andrewrk
Copy link
Member

Here's a test suite you can use: https://github.com/thejoshwolfe/yauzl/tree/master/test

@thejoshwolfe
Copy link
Sponsor Contributor

  • support zip64 files (anyone know of any I can test?)

The way I do this in yauzl is i manually test with very large zip files (created in a bash for loop or something) to verify that my code does work on actual real archives. but then for the automated test suite, i created an abstract artificial large file where a reader implementation presents data as though it needs zip64 support, but the whole thing is actually very lightweight. https://github.com/thejoshwolfe/yauzl/blob/b15f865c219536693108f13266bbd9d658fd6f66/test/zip64.js

  • probably add some more sanity checks/verification

A couple of suggestions from people opening security issues against yauzl:

  1. Verify file names don't escape the extract directory. You've already checked for unix absolute paths, but there's a few other cases. See https://github.com/thejoshwolfe/yauzl/blob/b15f865c219536693108f13266bbd9d658fd6f66/README.md#validatefilenamefilename (.. segments, windows absolute paths, any \)
  2. But even though you'd want to reject any \ as invalid in a file name, Microsoft .NET framework 4.5.0 until 4.6.1 had a bug where System.IO.Compression.ZipFile would not canonicalize the paths you passed into it, which means we should actually just replace \ with / before doing the validation. In yauzl this behavior isoptionally disabled with the strictFileNames option.
  3. the uncompressed_size checking in decompress() should happen in the loop instead of after it to catch zip-bomb attacks before they do any surprising damage.

Optional enhancements that I've found people actually benefit from:

  1. Check and override the file name from the Info-ZIP Unicode Path Extra Field (0x7075). For context, see support Info-ZIP Unicode Path Extra Field (0x7075) thejoshwolfe/yauzl#33 (comment) . (You'll need to support extra field parsing for zip64 support anyway.)
  2. Fail if entries are encrypted.
  3. Expose the central directory file comment. (Last i checked, GitHub puts the commit sha1 in this field when you download a repo as a zip file.)
  4. Support symlinks somehow. I still don't really know how to do this. (My open issue is this one: add entry.getMode() thejoshwolfe/yauzl#102 (comment) )
  5. I suggest continuing to ignore general purpose bit 11, which is supposed to switch on utf8 encoding instead of cp437 encoding. I've found that InfoZIP always thinks that utf8 mode is on regardless of the bit (considers utf8 to be the "platform default" on linux, the entire concept calling into question the purpose of a portable archive), so i don't think that cp437 is valuable to anyone. I recommend continuing down the "see if anyone cares" route by ignoring the bit rather than what I did, which is to meticulously support it.
  6. (And then you already got right several things that took me many iterations to support correctly, such as actually doing crc32 checking and creating parent directories on demand. Nice work! I recommend those be kept supported, in case anyone was wondering.)
  • handle/test corner case where zip file contains entry that conflicts with tmp directory

Isn't the out_dir already a tmp directory? why not just extract directly into it instead of making another temp directory?

  • support more decompression algorithms?

In 10 years, nobody has opened an issue against yauzl and convinced me that there was a compelling use case for supporting anything other than Deflate (8). I see you've already got Deflate64 (9) in there. Were you able to test that, or is that hypothetical?

@thejoshwolfe
Copy link
Sponsor Contributor

Here's a test suite you can use: https://github.com/thejoshwolfe/yauzl/tree/master/test

There is a medium amount of special case logic in the test harness that makes the success/ and failure/ example .zip files almost straight-forward to test. This is probably the least obvious thing: https://github.com/thejoshwolfe/yauzl/blob/b15f865c219536693108f13266bbd9d658fd6f66/test/test.js#L461-L463

@marler8997
Copy link
Contributor Author

marler8997 commented Apr 22, 2024

Isn't the out_dir already a tmp directory? why not just extract directly into it instead of making another temp directory?

We actually use the tmp directory save the zip contents to a file before extracting it, then I extract everything to a subdirectory to avoid any potential name conflicts with the zip file itself...then we can remove that zip file and collapse that temporary subdiectory (also avoiding name conflicts with the subdirectory while doing this). If anyone has a better way to do this please let me know! :)

out-dir/zip (the temporary zip file)
out-dir/tmp/ (the temporary subdirectory to extract files to)

Then afterwards we do something like rm out-dir/zip && mv out-dir/tmp/* out-dir/, taking care to avoid conflicts with tmp while we are moving it's children up.

@marler8997
Copy link
Contributor Author

marler8997 commented Apr 22, 2024

I see you've already got Deflate64 (9) in there. Were you able to test that, or is that hypothetical?

It seemed to be working but I found a deflate64 example that doesn't work...looks like this would require some extra work in our std module deflate implementation to actually work in general. I've updated the implementation to fail on seeing this format for now. We could make an attempt to process it as non 64 deflate, but seems better to just say we don't support it until we actually do.

@andrewrk
Copy link
Member

andrewrk commented Apr 22, 2024

rm out-dir/zip && mv out-dir/tmp/* out-dir/
If anyone has a better way to do this please let me know! :)

This is N+1 rename syscalls where N is the number of files in the root directory of the zip file, and fails if the zip file has a file named tmp inside of it.

Instead, I suggest to download the zip file to zig-cache/tmp/$RANDOM_FILENAME.zip, and then extract directly into out_dir. Then it is 1 rename syscall regardless of N, and avoids that failure edge case.

As a bonus you could extract the build.zig.zon first, and use its filtering rules to skip extraction of excluded files, but if I remember correctly that will require some larger reworking of the fetch logic that probably should be saved for potential future follow-up work, if done at all. An equivalent optimization potential exists for tar files (in case build.zig.zon was smartly emitted first in the tarball) and git fetching.

@squeek502
Copy link
Collaborator

squeek502 commented Apr 23, 2024

First (very preliminary, just a few minutes of fuzzing) round of fuzzing crashes:

zip-fuzzing-crashes-20240422.zip

From a quick look, it found an integer overflow, and then also NUL being in file paths causing an assertion failure. The crashes can be reproduced with the following code:

const std = @import("std");

test "zip fuzzing repro" {
    const testfile = "id:000000,sig:06,src:000033,time:147,execs:1011,op:havoc,rep:11";

    var tmp = std.testing.tmpDir(.{});
    defer tmp.cleanup();

    var file = try std.fs.cwd().openFile(testfile, .{});
    defer file.close();

    std.zip.extract(tmp.dir, file, .{}) catch {};
}

The fuzzer implementation is in a branch here: https://github.com/squeek502/zig-std-lib-fuzzing/tree/zip

Side note: one thing that would be nice from a fuzzing perspective would be the option to do everything in-memory instead of hitting the filesystem, but not sure how feasible that is (see the tar fuzzer for how the filesystem was able to be avoided with std.tar (but there's also a tar-fs fuzzer that untars to the filesystem)).

@marler8997
Copy link
Contributor Author

marler8997 commented Apr 23, 2024

one thing that would be nice from a fuzzing perspective would be the option to do everything in-memory instead of hitting the filesystem

I just pushed an updated commit with this ability leveraging std.io.SeekableStream.

P.S. oh actually, you can now write/read the zip file in memory, however, the extract function still goes to the filesystem. Avoiding the filesystem altogether could be done with some sort of generic filesystem interface passed to the extract function...I heard there might be something like that in the works? (an io interface of some sort)

@squeek502
Copy link
Collaborator

Looking good! Ran the updated fuzzer for 12 hours and found no crashes.

Note: the fuzzer just checks for illegal behavior, not correctness. Some possible ideas for testing correctness:

  • If there's a zip implementation in C that can be used as an 'oracle' (i.e. we know that we want the Zig implementation to work the same way in ~all cases), then I could write a fuzzer that checks that the outputs of each are the same
  • If there's a compressor implementation that we can throw random inputs at, then we could test that the inputs round trip through the compressor -> the Zig decompressor.

@dweiller
Copy link
Contributor

dweiller commented Apr 25, 2024

Side note: one thing that would be nice from a fuzzing perspective would be the option to do everything in-memory instead of hitting the filesystem, but not sure how feasible that is (see the tar fuzzer for how the filesystem was able to be avoided with std.tar (but there's also a tar-fs fuzzer that untars to the filesystem)).

Are you fuzzing on windows or linux? On linux you can get around this by making a tempfs mount that your fuzzer runs on so you never actually hits the harddrive/ssd.

@marler8997 marler8997 marked this pull request as ready for review April 29, 2024 03:20
fixes ziglang#17408

Helpful reviewers/testers include Joshe Wolfe, Auguste Rame, Andrew
Kelley and Jacob Young.

Co-authored-by: Joel Gustafson <joelg@mit.edu>
@jacobly0 jacobly0 merged commit a96b78c into ziglang:master May 3, 2024
10 checks passed
@marler8997 marler8997 deleted the zip branch May 4, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ability to fetch from zip files
6 participants