-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json unfolding: parse incrementally instead of using cheshire #42638
Conversation
this lowers memory usage 100x for 50kb strings (our previous limit) and speeds up parsing by an order of magnitude
|
If you're not worried about representing empty arrays, then you can just represent them implicitly by having index integers in the paths of their values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add some guard rails to the case macro, but don't want to hold this up.
JsonParser$NumberType/INT Integer | ||
JsonParser$NumberType/LONG Long | ||
JsonParser$NumberType/FLOAT Float | ||
JsonParser$NumberType/DOUBLE Double | ||
JsonParser$NumberType/BIG_DECIMAL BigDecimal | ||
JsonParser$NumberType/BIG_INTEGER clojure.lang.BigInt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we care about exposing Integer
and Float
? We could be opinionated like Clojure and just promote them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, maybe; I've tried to repeat whatever behavior we had in place and was just eager to finally get PR up. 😁 I'll look into making this simpler.
src/metabase/util.cljc
Outdated
~@(concat | ||
(mapcat (fn [[test result]] | ||
#_ {:clj-kondo/ignore [:discouraged-var]} | ||
[(eval (enum-ordinal test)) result]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn this is a pity. Wonder whether we can't use reflection on the class instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also ensure that all the dispatch enums and the input enum have the same type.
Probably jumping the shark a bit, but it would be nice if we could effectively generate a Java switch statement - especially cool if we enforce exhaustiveness too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah generating switch would be cool, but absolutely no idea how to get there. As for dropping eval - if you have any ideas, I'm all ears. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some checks/tests
;; we could be more precise here and issue warning about nested fields (the one in `describe-json-fields`), | ||
;; but this limit could be hit by multiple json fields rather than only by this one. So for the sake of | ||
;; issuing only a single warning in logs we'll spill over limit by a single entry (instead of doing `<=`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a some subtle stuff - I had to read it twice to get the gist and, this is a bit embarrassing, I still don't see how we can hit it multiple times. Don't we stop reading tokens and then pop off the entire loop's tail call stack?
I think it's worth explicitly saying that you're leaving the item to trigger the warning later, and that the last1 item will be lopped off again then too.
It's a bit hacky, but I think you could avoid this trick by only logging if res
is still a transient, although that means reflection. Alternatively you could use another loop parameter as a latch.
Footnotes
-
Well, a random item will be removed. A pity that we forget the order by then. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see how we can hit it multiple times
If there are a few fields with JSON, even if you got them under the limit - they could be over 100 keys in total. So not in this function directly, I'll probably add some more words here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha. I hadn't looked much at the surrounding code.
(let [data (into {} (for [i (range (* 2 @#'sql-jdbc.describe-table/max-nested-field-columns))] | ||
[(str "key" i) i]))] | ||
;; +2 to limit since we go 1 over the limit, see comment in `json->types` | ||
(is (> (+ 2 @#'sql-jdbc.describe-table/max-nested-field-columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not test exactly what size it is? I think it's fine to update the test if we fix the "1 over" thing later.
test/metabase/util_test.cljc
Outdated
(is (not (case Month/MAY | ||
Month/APRIL false | ||
Month/MAY true | ||
false)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer this test with distinct values for all three branches so it's clearer.
(is (not (case Month/MAY | |
Month/APRIL false | |
Month/MAY true | |
false)))) | |
(is (= 3 (case Month/MAY | |
Month/APRIL 1 | |
Month/MAY 2 | |
3)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements, double ticked
src/metabase/util.cljc
Outdated
"Like `case`, but explicitly dispatch on Java enum ordinals." | ||
[value & clauses] | ||
#_ {:clj-kondo/ignore [:discouraged-var]} | ||
(let [types (map (comp type eval first) (partition 2 clauses))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there another way to look up the type from a symbol? 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this 🤢
(let [sym 'JsonToken/VALUE_STRING]
(clojure.lang.Reflector/invokeStaticMethod (resolve (symbol (namespace sym)))
"valueOf"
(to-array [(name sym)])))
Still better than eval though, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(defn sym->enum [s] 🤮)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as they say, be wrong on the internet and learn 😁
Month/MAY true | ||
false))) | ||
(testing "checks for type" | ||
(is (thrown? Exception #"`case-enum` only works.*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that we check this, but it still leaves the slightly more likely bug which is having a mismatch just in runtime value we take, e.g.
(u/case-enum DayOfWeek/SUNDAY
Month/JANUARY 1)
We can't protect against this without reflection or a more advanced technique, so I think this is fine, but perhaps worth a test documenting the sad as well.
src/metabase/util.cljc
Outdated
~@(concat | ||
(mapcat (fn [[test result]] | ||
#_ {:clj-kondo/ignore [:discouraged-var]} | ||
[(eval (enum-ordinal test)) result]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this can then be
[(.ordinal (sym->enum test)) result])
PS: do you know how to make inline code fragments use Clojure syntax highlighting?
|
||
Passing the same enum type as the ones you're checking in is on you, this is not checked." | ||
[value & clauses] | ||
(let [types (map (comp type sym->enum first) (partition 2 clauses))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You could calculate the type a bit more directly with just (comp resolve symbol namespace first)
@piranha Did you forget to add a milestone to the issue for this PR? When and where should I add a milestone? |
this lowers memory usage 100x for 50kb strings (our limit before this change) and speeds up parsing by an order of magnitude. For example, results for parsing 42kb string (already in memory): Before: Time per call: 6.32 ms Alloc per call: 8,732,850b After: Time per call: 55.16 us Alloc per call: 83,254b
#42682) this lowers memory usage 100x for 50kb strings (our limit before this change) and speeds up parsing by an order of magnitude. For example, results for parsing 42kb string (already in memory): Before: Time per call: 6.32 ms Alloc per call: 8,732,850b After: Time per call: 55.16 us Alloc per call: 83,254b
This lowers memory and time usage 100x for 50kb strings (our previous limit). For example, those are results for parsing 42kb JSON string (already in memory):
Nuances:
Resolves #34798