add std.zip and support zip files in build.zig.zon #19729

marler8997 · 2024-04-22T00:07:10Z

fixes #17408

Comments Addressed

added file verification for security based on yauzl
added option for backslash normalization (default is to reject filenames with them, build.zig.zon enables this option)
prevent zip-bomb by exiting decompression loop if it gets larger than specified
we now assert an error if we detect encryption (instead of misinterpreting encryption section as data)
write the temporary zip file to a separate place so we don't have to fixup the extracted files afterwards
ability to extract zip from memory
able to decompress all zip files under "success" directory in yauzl test directory (except ones with encryption)
detect and reject all unexpected data (excludes extension fields)
detect and propagate to src/Package/Fetch when there is a single root directory

Improvements:

more tests around bad zip files
deflate64 (requires std library work to enhance our current deflate implementation to support deflate64)
improve std.zip.Diagnostics with nice error messages like std.tar
implement more extensions like "override the file name from the Info-ZIP Unicode Path Extra Field (0x7075)" or preserving unix file permissions

andrewrk · 2024-04-22T08:50:19Z

Here's a test suite you can use: https://github.com/thejoshwolfe/yauzl/tree/master/test

thejoshwolfe · 2024-04-22T13:15:32Z

support zip64 files (anyone know of any I can test?)

The way I do this in yauzl is i manually test with very large zip files (created in a bash for loop or something) to verify that my code does work on actual real archives. but then for the automated test suite, i created an abstract artificial large file where a reader implementation presents data as though it needs zip64 support, but the whole thing is actually very lightweight. https://github.com/thejoshwolfe/yauzl/blob/b15f865c219536693108f13266bbd9d658fd6f66/test/zip64.js

probably add some more sanity checks/verification

A couple of suggestions from people opening security issues against yauzl:

Verify file names don't escape the extract directory. You've already checked for unix absolute paths, but there's a few other cases. See https://github.com/thejoshwolfe/yauzl/blob/b15f865c219536693108f13266bbd9d658fd6f66/README.md#validatefilenamefilename (.. segments, windows absolute paths, any \)
But even though you'd want to reject any \ as invalid in a file name, Microsoft .NET framework 4.5.0 until 4.6.1 had a bug where System.IO.Compression.ZipFile would not canonicalize the paths you passed into it, which means we should actually just replace \ with / before doing the validation. In yauzl this behavior isoptionally disabled with the strictFileNames option.
the uncompressed_size checking in decompress() should happen in the loop instead of after it to catch zip-bomb attacks before they do any surprising damage.

Optional enhancements that I've found people actually benefit from:

Check and override the file name from the Info-ZIP Unicode Path Extra Field (0x7075). For context, see support Info-ZIP Unicode Path Extra Field (0x7075) thejoshwolfe/yauzl#33 (comment) . (You'll need to support extra field parsing for zip64 support anyway.)
Fail if entries are encrypted.
Expose the central directory file comment. (Last i checked, GitHub puts the commit sha1 in this field when you download a repo as a zip file.)
Support symlinks somehow. I still don't really know how to do this. (My open issue is this one: add entry.getMode() thejoshwolfe/yauzl#102 (comment) )
I suggest continuing to ignore general purpose bit 11, which is supposed to switch on utf8 encoding instead of cp437 encoding. I've found that InfoZIP always thinks that utf8 mode is on regardless of the bit (considers utf8 to be the "platform default" on linux, the entire concept calling into question the purpose of a portable archive), so i don't think that cp437 is valuable to anyone. I recommend continuing down the "see if anyone cares" route by ignoring the bit rather than what I did, which is to meticulously support it.
(And then you already got right several things that took me many iterations to support correctly, such as actually doing crc32 checking and creating parent directories on demand. Nice work! I recommend those be kept supported, in case anyone was wondering.)

handle/test corner case where zip file contains entry that conflicts with tmp directory

Isn't the out_dir already a tmp directory? why not just extract directly into it instead of making another temp directory?

support more decompression algorithms?

In 10 years, nobody has opened an issue against yauzl and convinced me that there was a compelling use case for supporting anything other than Deflate (8). I see you've already got Deflate64 (9) in there. Were you able to test that, or is that hypothetical?

thejoshwolfe · 2024-04-22T13:26:28Z

Here's a test suite you can use: https://github.com/thejoshwolfe/yauzl/tree/master/test

There is a medium amount of special case logic in the test harness that makes the success/ and failure/ example .zip files almost straight-forward to test. This is probably the least obvious thing: https://github.com/thejoshwolfe/yauzl/blob/b15f865c219536693108f13266bbd9d658fd6f66/test/test.js#L461-L463

marler8997 · 2024-04-22T14:09:48Z

Isn't the out_dir already a tmp directory? why not just extract directly into it instead of making another temp directory?

We actually use the tmp directory save the zip contents to a file before extracting it, then I extract everything to a subdirectory to avoid any potential name conflicts with the zip file itself...then we can remove that zip file and collapse that temporary subdiectory (also avoiding name conflicts with the subdirectory while doing this). If anyone has a better way to do this please let me know! :)

out-dir/zip (the temporary zip file)
out-dir/tmp/ (the temporary subdirectory to extract files to)

Then afterwards we do something like rm out-dir/zip && mv out-dir/tmp/* out-dir/, taking care to avoid conflicts with tmp while we are moving it's children up.

marler8997 · 2024-04-22T19:26:38Z

I see you've already got Deflate64 (9) in there. Were you able to test that, or is that hypothetical?

It seemed to be working but I found a deflate64 example that doesn't work...looks like this would require some extra work in our std module deflate implementation to actually work in general. I've updated the implementation to fail on seeing this format for now. We could make an attempt to process it as non 64 deflate, but seems better to just say we don't support it until we actually do.

andrewrk · 2024-04-22T21:28:15Z

rm out-dir/zip && mv out-dir/tmp/* out-dir/
If anyone has a better way to do this please let me know! :)

This is N+1 rename syscalls where N is the number of files in the root directory of the zip file, and fails if the zip file has a file named tmp inside of it.

Instead, I suggest to download the zip file to zig-cache/tmp/$RANDOM_FILENAME.zip, and then extract directly into out_dir. Then it is 1 rename syscall regardless of N, and avoids that failure edge case.

As a bonus you could extract the build.zig.zon first, and use its filtering rules to skip extraction of excluded files, but if I remember correctly that will require some larger reworking of the fetch logic that probably should be saved for potential future follow-up work, if done at all. An equivalent optimization potential exists for tar files (in case build.zig.zon was smartly emitted first in the tarball) and git fetching.

squeek502 · 2024-04-23T07:06:21Z

First (very preliminary, just a few minutes of fuzzing) round of fuzzing crashes:

zip-fuzzing-crashes-20240422.zip

From a quick look, it found an integer overflow, and then also NUL being in file paths causing an assertion failure. The crashes can be reproduced with the following code:

const std = @import("std");

test "zip fuzzing repro" {
    const testfile = "id:000000,sig:06,src:000033,time:147,execs:1011,op:havoc,rep:11";

    var tmp = std.testing.tmpDir(.{});
    defer tmp.cleanup();

    var file = try std.fs.cwd().openFile(testfile, .{});
    defer file.close();

    std.zip.extract(tmp.dir, file, .{}) catch {};
}

The fuzzer implementation is in a branch here: https://github.com/squeek502/zig-std-lib-fuzzing/tree/zip

Side note: one thing that would be nice from a fuzzing perspective would be the option to do everything in-memory instead of hitting the filesystem, but not sure how feasible that is (see the tar fuzzer for how the filesystem was able to be avoided with std.tar (but there's also a tar-fs fuzzer that untars to the filesystem)).

marler8997 · 2024-04-23T15:23:48Z

one thing that would be nice from a fuzzing perspective would be the option to do everything in-memory instead of hitting the filesystem

I just pushed an updated commit with this ability leveraging std.io.SeekableStream.

P.S. oh actually, you can now write/read the zip file in memory, however, the extract function still goes to the filesystem. Avoiding the filesystem altogether could be done with some sort of generic filesystem interface passed to the extract function...I heard there might be something like that in the works? (an io interface of some sort)

squeek502 · 2024-04-24T21:14:57Z

Looking good! Ran the updated fuzzer for 12 hours and found no crashes.

Note: the fuzzer just checks for illegal behavior, not correctness. Some possible ideas for testing correctness:

If there's a zip implementation in C that can be used as an 'oracle' (i.e. we know that we want the Zig implementation to work the same way in ~all cases), then I could write a fuzzer that checks that the outputs of each are the same
If there's a compressor implementation that we can throw random inputs at, then we could test that the inputs round trip through the compressor -> the Zig decompressor.

dweiller · 2024-04-25T07:38:58Z

Side note: one thing that would be nice from a fuzzing perspective would be the option to do everything in-memory instead of hitting the filesystem, but not sure how feasible that is (see the tar fuzzer for how the filesystem was able to be avoided with std.tar (but there's also a tar-fs fuzzer that untars to the filesystem)).

Are you fuzzing on windows or linux? On linux you can get around this by making a tempfs mount that your fuzzer runs on so you never actually hits the harddrive/ssd.

fixes ziglang#17408 Helpful reviewers/testers include Joshe Wolfe, Auguste Rame, Andrew Kelley and Jacob Young. Co-authored-by: Joel Gustafson <joelg@mit.edu>

marler8997 force-pushed the zip branch 2 times, most recently from 2709542 to 52252a3 Compare April 22, 2024 04:39

andrewrk requested a review from thejoshwolfe April 22, 2024 08:46

marler8997 force-pushed the zip branch from 52252a3 to 1b9f725 Compare April 22, 2024 13:59

marler8997 force-pushed the zip branch 3 times, most recently from 1cd6f42 to 7f40093 Compare April 22, 2024 18:57

marler8997 force-pushed the zip branch from 7f40093 to 8b517ce Compare April 22, 2024 19:28

marler8997 force-pushed the zip branch 4 times, most recently from ab0ad5f to 3d3ea7e Compare April 23, 2024 04:35

marler8997 force-pushed the zip branch from 3d3ea7e to 883c589 Compare April 23, 2024 07:26

tauoverpi mentioned this pull request Apr 23, 2024

zig fetch error: unrecognized 'Content-Type' header: 'application/zip' #19748

Closed

marler8997 force-pushed the zip branch from 883c589 to b6f68f1 Compare April 23, 2024 15:22

marler8997 force-pushed the zip branch from b6f68f1 to 74f70ea Compare April 23, 2024 16:34

marler8997 force-pushed the zip branch 4 times, most recently from 7d359b6 to c9d8073 Compare April 28, 2024 21:50

marler8997 force-pushed the zip branch 2 times, most recently from 1aa7dca to 46c095d Compare April 29, 2024 03:20

marler8997 marked this pull request as ready for review April 29, 2024 03:20

marler8997 force-pushed the zip branch 2 times, most recently from 2b1ccc1 to 08b8065 Compare April 30, 2024 20:57

add std.zip and support zip files in build.zig.zon

55d9e56

fixes ziglang#17408 Helpful reviewers/testers include Joshe Wolfe, Auguste Rame, Andrew Kelley and Jacob Young. Co-authored-by: Joel Gustafson <joelg@mit.edu>

marler8997 force-pushed the zip branch from 08b8065 to 55d9e56 Compare May 2, 2024 10:12

jacobly0 merged commit a96b78c into ziglang:master May 3, 2024
10 checks passed

marler8997 deleted the zip branch May 4, 2024 15:17

marler8997 mentioned this pull request May 4, 2024

update zip marler8997/zigup#123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add std.zip and support zip files in build.zig.zon #19729

add std.zip and support zip files in build.zig.zon #19729

marler8997 commented Apr 22, 2024 •

edited

andrewrk commented Apr 22, 2024

thejoshwolfe commented Apr 22, 2024

thejoshwolfe commented Apr 22, 2024

marler8997 commented Apr 22, 2024 •

edited

marler8997 commented Apr 22, 2024 •

edited

andrewrk commented Apr 22, 2024 •

edited

squeek502 commented Apr 23, 2024 •

edited

marler8997 commented Apr 23, 2024 •

edited

squeek502 commented Apr 24, 2024

dweiller commented Apr 25, 2024 •

edited

add std.zip and support zip files in build.zig.zon #19729

add std.zip and support zip files in build.zig.zon #19729

Conversation

marler8997 commented Apr 22, 2024 • edited

andrewrk commented Apr 22, 2024

thejoshwolfe commented Apr 22, 2024

thejoshwolfe commented Apr 22, 2024

marler8997 commented Apr 22, 2024 • edited

marler8997 commented Apr 22, 2024 • edited

andrewrk commented Apr 22, 2024 • edited

squeek502 commented Apr 23, 2024 • edited

marler8997 commented Apr 23, 2024 • edited

squeek502 commented Apr 24, 2024

dweiller commented Apr 25, 2024 • edited

marler8997 commented Apr 22, 2024 •

edited

marler8997 commented Apr 22, 2024 •

edited

marler8997 commented Apr 22, 2024 •

edited

andrewrk commented Apr 22, 2024 •

edited

squeek502 commented Apr 23, 2024 •

edited

marler8997 commented Apr 23, 2024 •

edited

dweiller commented Apr 25, 2024 •

edited